[
https://issues.apache.org/jira/browse/AURORA-176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Lambert updated AURORA-176:
---------------------------------
Sprint: Aurora Q4 Sprint 1
> more gracefully handle cases where user does not exist on machine
> -----------------------------------------------------------------
>
> Key: AURORA-176
> URL: https://issues.apache.org/jira/browse/AURORA-176
> Project: Aurora
> Issue Type: Task
> Components: Thermos
> Reporter: brian wickman
> Assignee: brian wickman
> Priority: Critical
>
> Tasks are going LOST for this reason:
> {noformat}
> Initialization of task runner failed: Could not construct sandbox:
> 'getpwnam(): name not found: zmanji'
> {noformat}
> I think everything is actually behaving correctly but this should probably be
> propagated as a task FAILED rather than LOST.
> From Jon B:
> As far as the specific case described here (failure during task runner
> initialization), this is actually done - it now results in FAILED:
> {noformat}
> 2 mins ago - FAILED : Initialization of task runner failed: Could not
> construct sandbox: 'getpwnam(): name not found: cosmingheorghe'
> {noformat}
> However, it's still possible for a task to go LOST when a user doesn't exist
> any more, when the runner terminates abnormally (after the sandbox is set up
> and task is already running):
> {noformat}
> D1024 21:17:11.360357 52726 runner.py:748] runnable: application
> D1024 21:17:11.360486 52726 runner.py:750] waiting:
> D1024 21:17:11.361073 52726 runner.py:647] _set_process_status(application <=
> WAITING, seq=4[auto])
> D1024 21:17:11.361279 52726 ckpt.py:367] Running state machine for
> process=application/seq=4
> D1024 21:17:11.361418 52726 runner.py:225] _on_process_transition:
> ProcessStatus(seq=4, process=u'application', start_time=None,
> coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
> fork_time=None)
> D1024 21:17:11.428257 52726 runner.py:90] Process on_waiting
> ProcessStatus(seq=4, process=u'application', start_time=None,
> coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
> fork_time=None)
> E1024 21:17:11.429981 52726 runner.py:545] Caught exception in
> self.control(): Unable to get pwent information!
> E1024 21:17:11.432827 52726 runner.py:546] Traceback (most recent call
> last):
> File "twitter/thermos/runner/runner.py", line 543, in control
> yield
> File "twitter/thermos/runner/runner.py", line 831, in run
> self._run()
> File "twitter/thermos/runner/runner.py", line 839, in _run
> iteration_wait = runner.run()
> File "twitter/thermos/runner/runner.py", line 278, in run
> launched = self.runner._run_plan(self.runner._regular_plan)
> File "twitter/thermos/runner/runner.py", line 764, in _run_plan
> self._set_process_status(process_name, ProcessState.WAITING)
> File "twitter/thermos/runner/runner.py", line 650, in _set_process_status
> self._dispatcher.dispatch(self._state, runner_ckpt, self._recovery)
> File "twitter/thermos/base/ckpt.py", line 372, in dispatch
> self._run_process_dispatch(process_update.state, process_update)
> File "twitter/thermos/base/ckpt.py", line 200, in _run_process_dispatch
> getattr(handler, handler_function)(process_update)
> File "twitter/thermos/runner/runner.py", line 93, in on_waiting
> process_update.process, process_update.seq + 1))
> File "twitter/thermos/runner/runner.py", line 675, in
> _task_process_from_process_name
> fork=close_ckpt_and_fork)
> File "twitter/thermos/runner/process.py", line 287, in __init__
> ProcessBase.__init__(self, *args, **kw)
> File "twitter/thermos/runner/process.py", line 93, in __init__
> user, current_user = self._getpwuid() # may raise self.UnknownUserError
> File "twitter/thermos/runner/process.py", line 214, in _getpwuid
> raise self.UnknownUserError('Unable to get pwent information!')
> UnknownUserError: Unable to get pwent information!
> and then the executor marks it as LOST
> I1024 21:17:11.947812 52673 status_manager.py:69] Executor polling thread
> detected termination condition.
> I1024 21:17:11.948065 52673 task_runner_wrapper.py:184] Runner is dead,
> skipping kill.
> I1024 21:18:11.114897 52673 status_manager.py:109] Waiting for terminal
> state, current state: ACTIVE
> ...
> I1024 21:18:11.616302 52673 status_manager.py:109] Waiting for terminal
> state, current state: ACTIVE
> I1024 21:18:12.117630 52673 status_manager.py:125] State we've accepted:
> Thermos(ACTIVE) / Failure: None
> E1024 21:18:12.117769 52673 status_manager.py:129] Runner is dead but task
> state unexpectedly ACTIVE!
> ...
> D1024 21:18:12.664916 52673 ckpt.py:336] Flipping task state from FINALIZING
> to KILLED
> D1024 21:18:12.665034 52673 runner.py:229] _on_task_transition:
> TaskStatus(state=3, runner_uid=0, runner_pid=52673,
> timestamp_ms=1382649492664)
> D1024 21:18:12.737503 52673 runner.py:188] Task on_killed(TaskStatus(state=3,
> runner_uid=0, runner_pid=52673, timestamp_ms=1382649492664))
> I1024 21:18:12.737726 52673 helper.py:125] Coordinator stage_twemcache
> [pid: 52758] completed.
> I1024 21:18:12.737891 52673 helper.py:136] Process stage_twemcache [pid:
> 52759] completed.
> I1024 21:18:12.738014 52673 runner.py:903] Transitioning application to LOST
> D1024 21:18:12.738142 52673 helper.py:204]
> TaskRunnerHelper.kill_process(stage_twemcache)
> I1024 21:18:12.738306 52673 helper.py:125] Coordinator stage_twemcache
> [pid: 52758] completed.
> I1024 21:18:12.738466 52673 helper.py:136] Process stage_twemcache [pid:
> 52759] completed.
> D1024 21:18:12.738584 52673 helper.py:212] => SIGKILL coordinator group
> 52758
> I1024 21:18:12.738967 52673 status_manager.py:150] Sending terminal state
> update: TASK_LOST
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)