[jira] [Commented] (AURORA-1335) Thermos should not immediately resort to killing processes

Stephan Erb (JIRA) Fri, 03 Jul 2015 01:31:52 -0700

    [ 
https://issues.apache.org/jira/browse/AURORA-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612999#comment-14612999
 ]


Stephan Erb commented on AURORA-1335:
-------------------------------------

[~wickman], do you have a hunch where the problem might be located? I will 
probably look into this bug and a hint from you might speed this up 
significantly.

> Thermos should not immediately resort to killing processes
> ----------------------------------------------------------
>
>                 Key: AURORA-1335
>                 URL: https://issues.apache.org/jira/browse/AURORA-1335
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor, Thermos
>            Reporter: Stephan Erb
>
> As a user of Aurora, I would like my processes to be terminated in a graceful 
> manner so that they have time to properly flush their buffers and cleanup 
> resources such as database connections.
> In its current form, the executor sends a TERM signal which is immediately 
> followed by a KILL signal. As an example, see the timings in the following 
> debug log output of a thermos runner:
> {code}
> D0526 13:20:56.829274 29 ckpt.py:348] Flipping task state from ACTIVE to 
> CLEANING
> D0526 13:20:56.829396 29 runner.py:242] _on_task_transition: 
> TaskStatus(state=5, runner_uid=0, runner_pid=29, timestamp_ms=1432639256829)
> D0526 13:20:56.829545 29 runner.py:188] Task on_cleaning(TaskStatus(state=5, 
> runner_uid=0, runner_pid=29, timestamp_ms=1432639256829))
> TaskRunnerHelper.terminate_process(service)
> D0526 13:20:56.832633 29 helper.py:238]    => SIGTERM pid 119
> D0526 13:20:56.832775 29 runner.py:327] TaskRunnerStage[CLEANING]: 
> Finalization remaining: 59.9783368111
> D0526 13:20:56.834014 118 process.py:103] [process:  118=service]: child 
> state transition 
> [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service]
>  <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, 
> process=u'service', start_time=None, coordinator_pid=None, pid=None, 
> return_code=-15, state=4, stop_time=1432639256.833447, fork_time=None), 
> runner_header=None)
> D0526 13:20:56.834566 118 process.py:103] [process:  118=service]: 
> Coordinator exiting.
> D0526 13:20:56.835757 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.836005 29 recordio.py:137] 
> /var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.service
>  has no data (current offset = 177)
> D0526 13:20:56.836102 29 muxer.py:155] select() returning 1 updates:
> D0526 13:20:56.836200 29 muxer.py:157]   = RunnerCkpt(task_status=None, 
> process_status=ProcessStatus(seq=3, process='service', start_time=None, 
> coordinator_pid=None, pid=None, return_code=-15, state=4, 
> stop_time=1432639256.833447, fork_time=None), runner_header=None)
> D0526 13:20:56.836282 29 ckpt.py:379] Running state machine for 
> process=service/seq=3
> D0526 13:20:56.836913 29 runner.py:238] _on_process_transition: 
> ProcessStatus(seq=3, process='service', start_time=None, 
> coordinator_pid=None, pid=None, return_code=-15, state=4, 
> stop_time=1432639256.833447, fork_time=None)
> D0526 13:20:56.837102 29 runner.py:156] Process on_killed 
> ProcessStatus(seq=3, process='service', start_time=None, 
> coordinator_pid=None, pid=None, return_code=-15, state=4, 
> stop_time=1432639256.833447, fork_time=None)
> D0526 13:20:56.837189 29 helper.py:244] TaskRunnerHelper.kill_process(service)
> D0526 13:20:56.837582 29 helper.py:252]    => SIGKILL coordinator group 118
> D0526 13:20:56.837745 29 helper.py:255]    => SIGKILL coordinator 118
> D0526 13:20:56.838052 29 muxer.py:94] unregistering service
> D0526 13:20:56.838052 29 runner.py:160] Process killed, marking it as a loss.
> D0526 13:20:56.838052 29 runner.py:327] TaskRunnerStage[CLEANING]: 
> Finalization remaining: 59.9730448723
> D0526 13:20:56.844118 29 runner.py:873] Run loop: Work to be done within 1.0s
> D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: child 
> state transition 
> [/var/run/thermos/checkpoints/1432639067962-service-200-ps-8-3ea6f2a6-535d-4565-ab7c-5fc628f29515/coordinator.reverse_proxy]
>  <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, 
> process=u'reverse_proxy', start_time=None, coordinator_pid=None, pid=None, 
> return_code=-15, state=4, stop_time=1432639256.893275, fork_time=None), 
> runner_header=None)
> D0526 13:20:56.894645 64 process.py:103] [process:   64=reverse_proxy]: 
> Coordinator exiting.
> D0526 13:20:57.849862 29 helper.py:376] Detected terminated process: pid=118, 
> status=9, rusage=resource.struct_rusage(ru_utime=0.008, ru_stime=0.024, 
> ru_maxrss=19080, ru_ixrss=0, ru_idrss=0, ru_isrss=0, ru_minflt=2448, 
> ru_majflt=0, ru_nswap=0, ru_inblock=0, ru_oublock=0, ru_msgsnd=0, 
> ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=20, ru_nivcsw=14)
> D0526 13:20:57.850090 29 runner.py:327] TaskRunnerStage[CLEANING]: 
> Finalization remaining: 58.9610338211
> D0526 13:20:57.852466 29 runner.py:870] Run loop: No more work to be done in 
> state CLEANING
> D0526 13:20:57.852730 29 ckpt.py:348] Flipping task state from CLEANING to 
> FINALIZING
> {code}
> Expected behavior would be a that Thermos only resorts to killing when the 
> application does not honor the termination requests.
> Using the HTTP signals `/quitquitquit` and `/abortabortabort` is not an 
> option due to inherent security problems of the unauthenticated requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1335) Thermos should not immediately resort to killing processes

Reply via email to