[
https://issues.apache.org/jira/browse/AURORA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612999#comment-14612999
]
Stephan Erb commented on AURORA-1335:
-------------------------------------
[~wickman], do you have a hunch where the problem might be located? I will
probably look into this bug and a hint from you might speed this up
significantly.
> Thermos should not immediately resort to killing processes
> ----------------------------------------------------------
>
> Key: AURORA-1335
> URL: https://issues.apache.org/jira/browse/AURORA-1335
> Project: Aurora
> Issue Type: Bug
> Components: Executor, Thermos
> Reporter: Stephan Erb
>
> As a user of Aurora, I would like my processes to be terminated in a graceful
> manner so that they have time to properly flush their buffers and cleanup
> resources such as database connections.
> In its current form, the executor sends a TERM signal which is immediately
> followed by a KILL signal. As an example, see the timings in the following
> debug log output of a thermos runner:
> {code}
> D0526 13:20:56.829274 29 ckpt.py:348] Flipping task state from ACTIVE to
> CLEANING
> D0526 13:20:56.829396 29 runner.py:242] _on_task_transition:
> TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829)
> D0526 13:20:56.829545 29 runner.py:188] Task on_cleaning(TaskStatus(state=5,
> runner_uid=0, runner_pid=29, timestamp_ms=1432639256829))
> TaskRunnerHelper.terminate_process(service)
> D0526 13:20:56.832633 29 helper.py:238] => SIGTERM pid 119
> D0526 13:20:56.832775 29 runner.py:327] TaskRunnerStage[CLEANING]:
> Finalization remaining: 59.9783368111
> D0526 13:20:56.834014 118 process.py:103] [process: 118=service]: child
> state transition
> [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service]
> <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3,
> process=u'service', start_time=None, coordinator_pid=None, pid=None,
> return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None),
> runner_header=None)
> D0526 13:20:56.834566 118 process.py:103] [process: 118=service]:
> Coordinator exiting.
> D0526 13:20:56.835757 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.836005 29 recordio.py:137]
> /var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service
> has no data (current offset = 177)
> D0526 13:20:56.836102 29 muxer.py:155] select() returning 1 updates:
> D0526 13:20:56.836200 29 muxer.py:157] = RunnerCkpt(task_status=None,
> process_status=ProcessStatus(seq=3, process='service', start_time=None,
> coordinator_pid=None, pid=None, return_code=-15, state=4,
> stop_time=1432639256.833447, fork_time=None), runner_header=None)
> D0526 13:20:56.836282 29 ckpt.py:379] Running state machine for
> process=service/seq=3
> D0526 13:20:56.836913 29 runner.py:238] _on_process_transition:
> ProcessStatus(seq=3, process='service', start_time=None,
> coordinator_pid=None, pid=None, return_code=-15, state=4,
> stop_time=1432639256.833447, fork_time=None)
> D0526 13:20:56.837102 29 runner.py:156] Process on_killed
> ProcessStatus(seq=3, process='service', start_time=None,
> coordinator_pid=None, pid=None, return_code=-15, state=4,
> stop_time=1432639256.833447, fork_time=None)
> D0526 13:20:56.837189 29 helper.py:244] TaskRunnerHelper.kill_process(service)
> D0526 13:20:56.837582 29 helper.py:252] => SIGKILL coordinator group 118
> D0526 13:20:56.837745 29 helper.py:255] => SIGKILL coordinator 118
> D0526 13:20:56.838052 29 muxer.py:94] unregistering service
> D0526 13:20:56.838052 29 runner.py:160] Process killed, marking it as a loss.
> D0526 13:20:56.838052 29 runner.py:327] TaskRunnerStage[CLEANING]:
> Finalization remaining: 59.9730448723
> D0526 13:20:56.844118 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.894645 64 process.py:103] [process: 64=reverse_proxy]: child
> state transition
> [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.reverse_proxy]
> <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3,
> process=u'reverse_proxy', start_time=None, coordinator_pid=None, pid=None,
> return_code=-15, state=4, stop_time=1432639256.893275, fork_time=None),
> runner_header=None)
> D0526 13:20:56.894645 64 process.py:103] [process: 64=reverse_proxy]:
> Coordinator exiting.
> D0526 13:20:57.849862 29 helper.py:376] Detected terminated process: pid=118,
> status=9, rusage=resource.struct_rusage(ru_utime=0.008, ru_stime=0.024,
> ru_maxrss=19080, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=2448,
> ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0,
> ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=20, ru_nivcsw=14)
> D0526 13:20:57.850090 29 runner.py:327] TaskRunnerStage[CLEANING]:
> Finalization remaining: 58.9610338211
> D0526 13:20:57.852466 29 runner.py:870] Run loop: No more work to be done in
> state CLEANING
> D0526 13:20:57.852730 29 ckpt.py:348] Flipping task state from CLEANING to
> FINALIZING
> {code}
> Expected behavior would be a that Thermos only resorts to killing when the
> application does not honor the termination requests.
> Using the HTTP signals `/quitquitquit` and `/abortabortabort` is not an
> option due to inherent security problems of the unauthenticated requests.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)