https://issues.apache.org/jira/browse/MESOS-2580
On Tue, Mar 31, 2015 at 11:20 AM, Niklas Nielsen <[email protected]> wrote: > Hi Henning, > > It does sound suspicious. > Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we can > track progress and assign an owner? > > Thanks, > Niklas > > On 31 March 2015 at 10:00, Henning Schmiedehausen > <[email protected]> wrote: >> >> Ok, so here is my log file from the slave: >> https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd >> >> - Slave shuts down executor: >> >> I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local >> I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update >> TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local from executor(1)@10.101.131.128:55377 >> I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT >> I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received >> status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) >> for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT >> of framework Singularity_local >> I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager >> successfully handled status update TASK_KILLED (UUID: >> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local >> I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement >> for status update TASK_KILLED (UUID: >> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local to executor(1)@10.101.131.128:55377 >> I0330 16:28:00.443361 18422 slave.cpp:3193] Executor >> 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of >> framework Singularity_local exited with status 0 >> >> Now my framework receives the TASK_KILLED (I see that) but fails to >> send the ack message back to mesos. After 30 seconds: >> >> W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending >> status update TASK_RUNNING (UUID: >> ea431985-98de-4447-8668-fda26c95f040) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local >> I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding >> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for >> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT >> of framework Singularity_local to the slave >> I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update >> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local to [email protected]:5050 >> >> After another 20 seconds: >> >> W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending >> status update TASK_RUNNING (UUID: >> ea431985-98de-4447-8668-fda26c95f040) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local >> I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding >> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for >> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT >> of framework Singularity_local to the slave >> I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update >> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of >> framework Singularity_local to [email protected]:5050 >> >> so for my task " >> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT", >> which is finished (the executor has exited), the slave is now >> resending "TASK_RUNNING". And that can not be right. This should be >> "TASK_KILLED". >> >> As a result, the state machine in my framework is now confused, >> because a task just transitioned from TASK_KILLED to TASK_RUNNING. >> Which should not happen, because TASK_KILLED is a terminal state. >> >> So I guess that a bug in my framework code has exposed a bug in the >> mesos slave state resending code. >> >> -h >> >> (BTW, I am a bit concerned that these discussions do not happen on the >> mesos-dev list. I added it in Cc. This is still an Apache project, >> right? (Wears his Apache hat)). >> >> On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen >> <[email protected]> wrote: >> > Hi Ben, >> > >> > quick question about a comment in the mesos source code: >> > >> > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703 >> > >> > // If the update is driver-generated or master-generated, it >> > // does not require acknowledgement and so we unset the 'uuid' >> > // field of TaskStatus. Otherwise, we overwrite the field to >> > // ensure that a 0.22.0 scheduler driver supports explicit >> > // acknowledgements, even if running against a 0.21.0 cluster. >> > // >> > // TODO(bmahler): Update the slave / executor driver to ensure >> > // that 'uuid' is set accurately by the time it reaches the >> > // scheduler driver. This will be required for pure bindings. >> > if (from == UPID() || pid == UPID()) { >> > >> > No matter how much I twist and turn this piece of code, but I >> > interpret that as "if the sender of the message or the pid from the >> > message itself is the local address". Why would that skip >> > master-generated updates? >> > >> > In fact, my problems seem to be related to a misinterpretation of this >> > statement in my java driver (jesos, https://github.com/groupon/jesos). >> > However I believe that this uncovered an actual bug in the slave >> > behavior: >> > >> > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master >> > - Master sends the message to the Framework. >> > - Framework code fails to acknowledge the message (which is a >> > legitimate error case, because messages can get lost) >> > - After a timeout, the Slave resends the TaskStatus message, however, >> > it does so with state TASK_RUNNING, not TASK_KILLED. >> > >> > I could verify that behavior in 0.22.0 and the current trunk. I have a >> > bit more debugging in a log file in the office, i will send that to >> > you tomorrow. >> > >> > -h >> > >> > >> > >> > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman <[email protected]> >> > wrote: >> >> Hey Henning! >> >> >> >> Someone mentioned today that you were discussing a potential bug in >> >> 0.22.0 >> >> on IRC? Can you elaborate? We are taking this very seriously and would >> >> like >> >> to make sure any issues are dealt with promptly. >> >> >> >> Ben. >> > >> > >> > >> > -- >> > Henning Schmiedehausen - Groupon Engineering Principal >> > Groupon Inc, Palo Alto - [email protected] >> >> >> >> -- >> Henning Schmiedehausen - Groupon Engineering Principal >> Groupon Inc, Palo Alto - [email protected] > > -- Henning Schmiedehausen - Groupon Engineering Principal Groupon Inc, Palo Alto - [email protected]
