brian wickman created AURORA-176:
------------------------------------
Summary: more gracefully handle cases where user does not exist on
machine
Key: AURORA-176
URL: https://issues.apache.org/jira/browse/AURORA-176
Project: Aurora
Issue Type: Task
Components: Thermos
Reporter: brian wickman
Priority: Minor
Tasks are going LOST for this reason:
{noformat}
Initialization of task runner failed: Could not construct sandbox: 'getpwnam():
name not found: zmanji'
{noformat}
I think everything is actually behaving correctly but this should probably be
propagated as a task FAILED rather than LOST.
>From Jon B:
As far as the specific case described here (failure during task runner
initialization), this is actually done - it now results in FAILED:
{noformat}
2 mins ago - FAILED : Initialization of task runner failed: Could not construct
sandbox: 'getpwnam(): name not found: cosmingheorghe'
{noformat}
However, it's still possible for a task to go LOST when a user doesn't exist
any more, when the runner terminates abnormally (after the sandbox is set up
and task is already running):
{noformat}
D1024 21:17:11.360357 52726 runner.py:748] runnable: application
D1024 21:17:11.360486 52726 runner.py:750] waiting:
D1024 21:17:11.361073 52726 runner.py:647] _set_process_status(application <=
WAITING, seq=4[auto])
D1024 21:17:11.361279 52726 ckpt.py:367] Running state machine for
process=application/seq=4
D1024 21:17:11.361418 52726 runner.py:225] _on_process_transition:
ProcessStatus(seq=4, process=u'application', start_time=None,
coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
fork_time=None)
D1024 21:17:11.428257 52726 runner.py:90] Process on_waiting
ProcessStatus(seq=4, process=u'application', start_time=None,
coordinator_pid=None, pid=None, return_code=None, state=0, stop_time=None,
fork_time=None)
E1024 21:17:11.429981 52726 runner.py:545] Caught exception in self.control():
Unable to get pwent information!
E1024 21:17:11.432827 52726 runner.py:546] Traceback (most recent call last):
File "twitter/thermos/runner/runner.py", line 543, in control
yield
File "twitter/thermos/runner/runner.py", line 831, in run
self._run()
File "twitter/thermos/runner/runner.py", line 839, in _run
iteration_wait = runner.run()
File "twitter/thermos/runner/runner.py", line 278, in run
launched = self.runner._run_plan(self.runner._regular_plan)
File "twitter/thermos/runner/runner.py", line 764, in _run_plan
self._set_process_status(process_name, ProcessState.WAITING)
File "twitter/thermos/runner/runner.py", line 650, in _set_process_status
self._dispatcher.dispatch(self._state, runner_ckpt, self._recovery)
File "twitter/thermos/base/ckpt.py", line 372, in dispatch
self._run_process_dispatch(process_update.state, process_update)
File "twitter/thermos/base/ckpt.py", line 200, in _run_process_dispatch
getattr(handler, handler_function)(process_update)
File "twitter/thermos/runner/runner.py", line 93, in on_waiting
process_update.process, process_update.seq + 1))
File "twitter/thermos/runner/runner.py", line 675, in
_task_process_from_process_name
fork=close_ckpt_and_fork)
File "twitter/thermos/runner/process.py", line 287, in __init__
ProcessBase.__init__(self, *args, **kw)
File "twitter/thermos/runner/process.py", line 93, in __init__
user, current_user = self._getpwuid() # may raise self.UnknownUserError
File "twitter/thermos/runner/process.py", line 214, in _getpwuid
raise self.UnknownUserError('Unable to get pwent information!')
UnknownUserError: Unable to get pwent information!
and then the executor marks it as LOST
I1024 21:17:11.947812 52673 status_manager.py:69] Executor polling thread
detected termination condition.
I1024 21:17:11.948065 52673 task_runner_wrapper.py:184] Runner is dead,
skipping kill.
I1024 21:18:11.114897 52673 status_manager.py:109] Waiting for terminal state,
current state: ACTIVE
...
I1024 21:18:11.616302 52673 status_manager.py:109] Waiting for terminal state,
current state: ACTIVE
I1024 21:18:12.117630 52673 status_manager.py:125] State we've accepted:
Thermos(ACTIVE) / Failure: None
E1024 21:18:12.117769 52673 status_manager.py:129] Runner is dead but task
state unexpectedly ACTIVE!
...
D1024 21:18:12.664916 52673 ckpt.py:336] Flipping task state from FINALIZING to
KILLED
D1024 21:18:12.665034 52673 runner.py:229] _on_task_transition:
TaskStatus(state=3, runner_uid=0, runner_pid=52673, timestamp_ms=1382649492664)
D1024 21:18:12.737503 52673 runner.py:188] Task on_killed(TaskStatus(state=3,
runner_uid=0, runner_pid=52673, timestamp_ms=1382649492664))
I1024 21:18:12.737726 52673 helper.py:125] Coordinator stage_twemcache [pid:
52758] completed.
I1024 21:18:12.737891 52673 helper.py:136] Process stage_twemcache [pid:
52759] completed.
I1024 21:18:12.738014 52673 runner.py:903] Transitioning application to LOST
D1024 21:18:12.738142 52673 helper.py:204]
TaskRunnerHelper.kill_process(stage_twemcache)
I1024 21:18:12.738306 52673 helper.py:125] Coordinator stage_twemcache [pid:
52758] completed.
I1024 21:18:12.738466 52673 helper.py:136] Process stage_twemcache [pid:
52759] completed.
D1024 21:18:12.738584 52673 helper.py:212] => SIGKILL coordinator group 52758
I1024 21:18:12.738967 52673 status_manager.py:150] Sending terminal state
update: TASK_LOST
{noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)