Henning Schmiedehausen created MESOS-2580:
---------------------------------------------
Summary: mesos slave resends TaskStatus as TASK_RUNNING after
TASK_KILLED
Key: MESOS-2580
URL: https://issues.apache.org/jira/browse/MESOS-2580
Project: Mesos
Issue Type: Bug
Components: slave
Affects Versions: 0.22.0
Reporter: Henning Schmiedehausen
I believe that this uncovered an actual bug in the slave
behavior:
- Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master
- Master sends the message to the Framework.
- Framework code fails to acknowledge the message (which is a
legitimate error case, because messages can get lost)
- After a timeout, the Slave resends the TaskStatus message, however,
it does so with state TASK_RUNNING, not TASK_KILLED.
I could verify that behavior in 0.22.0 and the current trunk
here is my log file from the slave:
https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
- Slave shuts down executor:
I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local
I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update
TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local from executor(1)@10.101.131.128:55377
I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received
status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396)
for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
of framework Singularity_local
I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager
successfully handled status update TASK_KILLED (UUID:
7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local
I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement
for status update TASK_KILLED (UUID:
7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local to executor(1)@10.101.131.128:55377
I0330 16:28:00.443361 18422 slave.cpp:3193] Executor
'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of
framework Singularity_local exited with status 0
Now my framework receives the TASK_KILLED (I see that) but fails to
send the ack message back to mesos. After 30 seconds:
W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending
status update TASK_RUNNING (UUID:
ea431985-98de-4447-8668-fda26c95f040) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local
I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding
update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
of framework Singularity_local to the slave
I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update
TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local to [email protected]:5050
After another 20 seconds:
W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending
status update TASK_RUNNING (UUID:
ea431985-98de-4447-8668-fda26c95f040) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local
I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding
update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
of framework Singularity_local to the slave
I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update
TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
framework Singularity_local to [email protected]:5050
so for my task "
MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT",
which is finished (the executor has exited), the slave is now
resending "TASK_RUNNING". And that can not be right. This should be
"TASK_KILLED".
As a result, the state machine in my framework is now confused,
because a task just transitioned from TASK_KILLED to TASK_RUNNING.
Which should not happen, because TASK_KILLED is a terminal state.
So I guess that a bug in my framework code has exposed a bug in the
mesos slave state resending code.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)