Maxim Khutornenko created AURORA-1642:
-----------------------------------------

             Summary: Thermos runner finalization is broken
                 Key: AURORA-1642
                 URL: https://issues.apache.org/jira/browse/AURORA-1642
             Project: Aurora
          Issue Type: Bug
          Components: Executor
            Reporter: Maxim Khutornenko


We have noticed thermos runner finalization no longer works after this commit 
[024bac9dcb8f37e4b31210e3a0a7aea2345a16ab|https://reviews.apache.org/r/40922/] 
for tasks with blocking threads. 

I was able to reproduce it in Vagrant by extending the sleep timeout of the 
{{hello}} task and running {{aurora job killall}} immediately after launching 
it:
{noformat}
while true; do
      echo hello world
      sleep 600
{noformat}
The finalizer never has a chance to run and after 1 minute a task is forcefully 
aborted:
{noformat}
D0316 04:00:35.237905 19362 runner.py:951] Runner issued kill: force:False, 
preemption_wait:1 mins
D0316 04:00:35.238183 19362 runner.py:567] Flipping recovery mode off.
D0316 04:00:35.238308 19362 ckpt.py:348] Flipping task state from ACTIVE to 
ACTIVE
D0316 04:00:35.238437 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=0, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238)
D0316 04:00:35.239079 19362 runner.py:180] Task on_active(TaskStatus(state=0, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100835238))
D0316 04:00:35.241660 19362 ckpt.py:348] Flipping task state from ACTIVE to 
CLEANING
D0316 04:00:35.241765 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=5, runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241)
D0316 04:00:35.249836 19362 runner.py:188] Task on_cleaning(TaskStatus(state=5, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100835241))
D0316 04:00:35.249953 19362 helper.py:217] 
TaskRunnerHelper.terminate_process(hello)
D0316 04:00:35.256520 19362 helper.py:220]    => SIGTERM pid 19368
D0316 04:00:35.256705 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 59.9812531471
D0316 04:00:35.262578 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:00:36.263881 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:00:36.264199 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 58.9737620354
D0316 04:00:36.264734 19362 runner.py:929] Run loop: Work to be done within 1.0s
---<skipped>---
D0316 04:01:31.397888 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:31.398144 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 3.83981513977
D0316 04:01:31.398538 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:32.400230 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:32.401125 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 2.8368370533
D0316 04:01:32.401596 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:33.404506 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:33.404815 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 1.83315014839
D0316 04:01:33.405534 19362 runner.py:929] Run loop: Work to be done within 1.0s
D0316 04:01:34.406909 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:34.407223 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 0.830743074417
D0316 04:01:34.407908 19362 runner.py:929] Run loop: Work to be done within 0.8s
D0316 04:01:35.415529 19362 runner.py:939] Run loop: No updates collected, 
touching checkpoint.
D0316 04:01:35.415683 19362 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 0
D0316 04:01:35.415740 19362 runner.py:926] Run loop: No more work to be done in 
state CLEANING
D0316 04:01:35.415888 19362 runner.py:903] Forced terminal state: KILLED
D0316 04:01:35.415936 19362 ckpt.py:348] Flipping task state from CLEANING to 
KILLED
D0316 04:01:35.415980 19362 runner.py:242] _on_task_transition: 
TaskStatus(state=3, runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415)
D0316 04:01:35.416937 19362 runner.py:201] Task on_killed(TaskStatus(state=3, 
runner_uid=0, runner_pid=19362, timestamp_ms=1458100895415))
D0316 04:01:35.417393 19362 runner.py:684] _set_process_status(hello <= KILLED, 
seq=3[auto])
D0316 04:01:35.417458 19362 ckpt.py:379] Running state machine for 
process=hello/seq=3
D0316 04:01:35.417460 19362 runner.py:238] _on_process_transition: 
ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None)
D0316 04:01:35.417853 19362 runner.py:156] Process on_killed 
ProcessStatus(seq=3, process=u'hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-1, state=4, stop_time=1458100895.417381, fork_time=None)
D0316 04:01:35.417921 19362 helper.py:226] TaskRunnerHelper.kill_process(hello)
D0316 04:01:35.418145 19362 helper.py:234]    => SIGKILL coordinator group 19367
D0316 04:01:35.418934 19362 helper.py:237]    => SIGKILL coordinator 19367
D0316 04:01:35.419006 19362 muxer.py:94] unregistering hello
D0316 04:01:35.419069 19362 runner.py:160] Process killed, marking it as a loss.
{noformat}


After reverting the above commit, the task is killed almost instantaneously 
with the finalizer exiting properly:
{noformat}
D0316 04:03:42.339101 20053 runner.py:939] Runner issued kill: force:False, 
preemption_wait:1 mins
D0316 04:03:42.339370 20053 runner.py:564] Flipping recovery mode off.
D0316 04:03:42.339503 20053 ckpt.py:348] Flipping task state from ACTIVE to 
ACTIVE
D0316 04:03:42.339571 20053 runner.py:242] _on_task_transition: 
TaskStatus(state=0, runner_uid=0, runner_pid=20053, timestamp_ms=1458101022339)
D0316 04:03:42.340327 20053 runner.py:180] Task on_active(TaskStatus(state=0, 
runner_uid=0, runner_pid=20053, timestamp_ms=1458101022339))
D0316 04:03:42.343360 20053 ckpt.py:348] Flipping task state from ACTIVE to 
CLEANING
D0316 04:03:42.343463 20053 runner.py:242] _on_task_transition: 
TaskStatus(state=5, runner_uid=0, runner_pid=20053, timestamp_ms=1458101022343)
D0316 04:03:42.344290 20053 runner.py:188] Task on_cleaning(TaskStatus(state=5, 
runner_uid=0, runner_pid=20053, timestamp_ms=1458101022343))
D0316 04:03:42.344388 20053 helper.py:217] 
TaskRunnerHelper.terminate_process(hello)
D0316 04:03:42.349311 20053 helper.py:220]    => SIGTERM pid 20059
D0316 04:03:42.349566 20058 process.py:132] [process:20058=hello]: child state 
transition 
[/var/lib/mesos/slaves/cf74b250-4134-4227-8f86-b48a67e15b89-S0/frameworks/cf74b250-4134-4227-8f86-b48a67e15b89-0000/executors/thermos-www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/runs/0310abe1-3a24-4308-9c5e-6df7714ca2f2/checkpoints/checkpoints/www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/coordinator.hello]
 <= RunnerCkpt(task_status=None, process_status=ProcessStatus(seq=3, 
process=u'hello', start_time=None, coordinator_pid=None, pid=None, 
return_code=-15, state=4, stop_time=1458101022.34949, fork_time=None), 
runner_header=None)
D0316 04:03:42.349777 20053 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 59.9893760681
D0316 04:03:42.350320 20058 process.py:132] [process:20058=hello]: Coordinator 
exiting.
D0316 04:03:42.360085 20053 runner.py:917] Run loop: Work to be done within 1.0s
D0316 04:03:42.360292 20053 recordio.py:137] 
/var/lib/mesos/slaves/cf74b250-4134-4227-8f86-b48a67e15b89-S0/frameworks/cf74b250-4134-4227-8f86-b48a67e15b89-0000/executors/thermos-www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/runs/0310abe1-3a24-4308-9c5e-6df7714ca2f2/checkpoints/checkpoints/www-data-prod-hello-0-acb78a97-a551-4e16-ba7e-2c488717556d/coordinator.hello
 has no data (current offset = 171)
D0316 04:03:42.360383 20053 muxer.py:155] select() returning 1 updates:
D0316 04:03:42.360440 20053 muxer.py:157]   = RunnerCkpt(task_status=None, 
process_status=ProcessStatus(seq=3, process='hello', start_time=None, 
coordinator_pid=None, pid=None, return_code=-15, state=4, 
stop_time=1458101022.34949, fork_time=None), runner_header=None)
D0316 04:03:42.360486 20053 ckpt.py:379] Running state machine for 
process=hello/seq=3
D0316 04:03:42.360543 20053 runner.py:238] _on_process_transition: 
ProcessStatus(seq=3, process='hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-15, state=4, stop_time=1458101022.34949, fork_time=None)
D0316 04:03:42.361098 20053 runner.py:156] Process on_killed 
ProcessStatus(seq=3, process='hello', start_time=None, coordinator_pid=None, 
pid=None, return_code=-15, state=4, stop_time=1458101022.34949, fork_time=None)
D0316 04:03:42.361164 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
D0316 04:03:42.361391 20053 helper.py:234]    => SIGKILL coordinator group 20058
D0316 04:03:42.361438 20053 helper.py:237]    => SIGKILL coordinator 20058
D0316 04:03:42.361488 20053 muxer.py:94] unregistering hello
D0316 04:03:42.361536 20053 runner.py:160] Process killed, marking it as a loss.
D0316 04:03:42.361555 20053 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 59.9775869846
D0316 04:03:42.361555 20053 runner.py:917] Run loop: Work to be done within 1.0s
D0316 04:03:42.361555 20053 runner.py:925] Update collection only took 0.0s, 
idling 1.0s
D0316 04:03:43.361848 20053 runner.py:927] Run loop: No updates collected, 
touching checkpoint.
D0316 04:03:43.362040 20053 helper.py:358] Detected terminated process: 
pid=20058, status=9, rusage=resource.struct_rusage(ru_utime=0.00684, 
ru_stime=0.008775999999999999, ru_maxrss=16288, ru_ixrss=0, ru_idrss=0, 
ru_isrss=0, ru_minflt=4105, ru_majflt=0, ru_nswap=0, ru_inblock=0, 
ru_oublock=48, ru_msgsnd=0, ru_msgrcv=0, ru_nsignals=0, ru_nvcsw=9, 
ru_nivcsw=14)
D0316 04:03:43.362119 20053 runner.py:327] TaskRunnerStage[CLEANING]: 
Finalization remaining: 58.9770309925
D0316 04:03:43.362361 20053 runner.py:914] Run loop: No more work to be done in 
state CLEANING
D0316 04:03:43.362416 20053 ckpt.py:348] Flipping task state from CLEANING to 
FINALIZING
D0316 04:03:43.362462 20053 runner.py:242] _on_task_transition: 
TaskStatus(state=6, runner_uid=0, runner_pid=20053, timestamp_ms=1458101023362)
D0316 04:03:43.363035 20053 runner.py:193] Task 
on_finalizing(TaskStatus(state=6, runner_uid=0, runner_pid=20053, 
timestamp_ms=1458101023362))
D0316 04:03:43.363317 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
D0316 04:03:43.363497 20053 helper.py:234]    => SIGKILL coordinator group 20058
D0316 04:03:43.363550 20053 runner.py:801] Schedule pass:
D0316 04:03:43.363619 20053 runner.py:804] running: 
D0316 04:03:43.363662 20053 runner.py:805] finished: 
D0316 04:03:43.363723 20053 runner.py:815] runnable: 
D0316 04:03:43.363761 20053 runner.py:817] waiting: 
D0316 04:03:43.363961 20053 runner.py:348] TaskRunnerStage[FINALIZING]: 
Finalization remaining: 58.9751880169
D0316 04:03:43.364037 20053 runner.py:753] running:0 runnable:0 waiting:0 
complete:True
D0316 04:03:43.364084 20053 runner.py:914] Run loop: No more work to be done in 
state FINALIZING
D0316 04:03:43.364084 20053 runner.py:891] Forced terminal state: KILLED
D0316 04:03:43.364084 20053 ckpt.py:348] Flipping task state from FINALIZING to 
KILLED
D0316 04:03:43.364084 20053 runner.py:242] _on_task_transition: 
TaskStatus(state=3, runner_uid=0, runner_pid=20053, timestamp_ms=1458101023364)
D0316 04:03:43.364553 20053 runner.py:201] Task on_killed(TaskStatus(state=3, 
runner_uid=0, runner_pid=20053, timestamp_ms=1458101023364))
D0316 04:03:43.364784 20053 helper.py:226] TaskRunnerHelper.kill_process(hello)
D0316 04:03:43.364965 20053 helper.py:234]    => SIGKILL coordinator group 
20058 
{noformat}

Root cause still TBD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to