[ 
https://issues.apache.org/jira/browse/AURORA-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059579#comment-14059579
 ] 

Joe Smith commented on AURORA-176:
----------------------------------

I contend this priority should be bumped up- we're misclassifying a set of 
problems that can lead to alert fatigue and wasted debugging time (when we 
arguably should be showing FAILED because the user no longer exists).

> more gracefully handle cases where user does not exist on machine
> -----------------------------------------------------------------
>
>                 Key: AURORA-176
>                 URL: https://issues.apache.org/jira/browse/AURORA-176
>             Project: Aurora
>          Issue Type: Task
>          Components: Thermos
>            Reporter: brian wickman
>            Priority: Minor
>
> Tasks are going LOST for this reason:
> {noformat}
> Initialization of task runner failed: Could not construct sandbox: 
> 'getpwnam(): name not found: zmanji'
> {noformat}
> I think everything is actually behaving correctly but this should probably be 
> propagated as a task FAILED rather than LOST.
> From Jon B:
> As far as the specific case described here (failure during task runner 
> initialization), this is actually done - it now results in FAILED:
> {noformat}
> 2 mins ago - FAILED : Initialization of task runner failed: Could not 
> construct sandbox: 'getpwnam(): name not found: cosmingheorghe'
> {noformat}
> However, it's still possible for a task to go LOST when a user doesn't exist 
> any more, when the runner terminates abnormally (after the sandbox is set up 
> and task is already running):
> {noformat}
> D1024 21:17:11.360357 52726 runner.py:748] runnable: application
> D1024 21:17:11.360486 52726 runner.py:750] waiting: 
> D1024 21:17:11.361073 52726 runner.py:647] _set_process_status(application <= 
> WAITING, seq=4[auto])
> D1024 21:17:11.361279 52726 ckpt.py:367] Running state machine for 
> process=application/seq=4
> D1024 21:17:11.361418 52726 runner.py:225] _on_process_transition: 
> ProcessStatus(seq=4, process=u'application', start_time=None, 
> coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None, 
> fork_time=None)
> D1024 21:17:11.428257 52726 runner.py:90] Process on_waiting 
> ProcessStatus(seq=4, process=u'application', start_time=None, 
> coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None, 
> fork_time=None)
> E1024 21:17:11.429981 52726 runner.py:545] Caught exception in 
> self.control(): Unable to get pwent information!
> E1024 21:17:11.432827 52726 runner.py:546]   Traceback (most recent call 
> last):
>   File "twitter/thermos/runner/runner.py", line 543, in control
>     yield
>   File "twitter/thermos/runner/runner.py", line 831, in run
>     self._run()
>   File "twitter/thermos/runner/runner.py", line 839, in _run
>     iteration_wait = runner.run()
>   File "twitter/thermos/runner/runner.py", line 278, in run
>     launched = self.runner._run_plan(self.runner._regular_plan)
>   File "twitter/thermos/runner/runner.py", line 764, in _run_plan
>     self._set_process_status(process_name, ProcessState.WAITING)
>   File "twitter/thermos/runner/runner.py", line 650, in _set_process_status
>     self._dispatcher.dispatch(self._state, runner_ckpt, self._recovery)
>   File "twitter/thermos/base/ckpt.py", line 372, in dispatch
>     self._run_process_dispatch(process_update.state, process_update)
>   File "twitter/thermos/base/ckpt.py", line 200, in _run_process_dispatch
>     getattr(handler, handler_function)(process_update)
>   File "twitter/thermos/runner/runner.py", line 93, in on_waiting
>     process_update.process, process_update.seq + 1))
>   File "twitter/thermos/runner/runner.py", line 675, in 
> _task_process_from_process_name
>     fork=close_ckpt_and_fork)
>   File "twitter/thermos/runner/process.py", line 287, in __init__
>     ProcessBase.__init__(self, *args, **kw)
>   File "twitter/thermos/runner/process.py", line 93, in __init__
>     user, current_user = self._getpwuid() # may raise self.UnknownUserError
>   File "twitter/thermos/runner/process.py", line 214, in _getpwuid
>     raise self.UnknownUserError('Unable to get pwent information!')
> UnknownUserError: Unable to get pwent information!
> and then the executor marks it as LOST
> I1024 21:17:11.947812 52673 status_manager.py:69] Executor polling thread 
> detected termination condition.
> I1024 21:17:11.948065 52673 task_runner_wrapper.py:184] Runner is dead, 
> skipping kill.
> I1024 21:18:11.114897 52673 status_manager.py:109] Waiting for terminal 
> state, current state: ACTIVE
> ...
> I1024 21:18:11.616302 52673 status_manager.py:109] Waiting for terminal 
> state, current state: ACTIVE
> I1024 21:18:12.117630 52673 status_manager.py:125] State we've accepted: 
> Thermos(ACTIVE) / Failure: None
> E1024 21:18:12.117769 52673 status_manager.py:129] Runner is dead but task 
> state unexpectedly ACTIVE!
> ...
> D1024 21:18:12.664916 52673 ckpt.py:336] Flipping task state from FINALIZING 
> to KILLED
> D1024 21:18:12.665034 52673 runner.py:229] _on_task_transition: 
> TaskStatus(state=3, runner_uid=0, runner_pid=52673, 
> timestamp_ms=1382649492664)
> D1024 21:18:12.737503 52673 runner.py:188] Task on_killed(TaskStatus(state=3, 
> runner_uid=0, runner_pid=52673, timestamp_ms=1382649492664))
> I1024 21:18:12.737726 52673 helper.py:125]   Coordinator stage_twemcache 
> [pid: 52758] completed.
> I1024 21:18:12.737891 52673 helper.py:136]   Process stage_twemcache [pid: 
> 52759] completed.
> I1024 21:18:12.738014 52673 runner.py:903] Transitioning application to LOST
> D1024 21:18:12.738142 52673 helper.py:204] 
> TaskRunnerHelper.kill_process(stage_twemcache)
> I1024 21:18:12.738306 52673 helper.py:125]   Coordinator stage_twemcache 
> [pid: 52758] completed.
> I1024 21:18:12.738466 52673 helper.py:136]   Process stage_twemcache [pid: 
> 52759] completed.
> D1024 21:18:12.738584 52673 helper.py:212]    => SIGKILL coordinator group 
> 52758
> I1024 21:18:12.738967 52673 status_manager.py:150] Sending terminal state 
> update: TASK_LOST
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to