Hi Henning, It does sound suspicious. Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we can track progress and assign an owner?
Thanks, Niklas On 31 March 2015 at 10:00, Henning Schmiedehausen < [email protected]> wrote: > Ok, so here is my log file from the slave: > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd > > - Slave shuts down executor: > > I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update > TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local from executor(1)@10.101.131.128:55377 > I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received > status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) > for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local > I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager > successfully handled status update TASK_KILLED (UUID: > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement > for status update TASK_KILLED (UUID: > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to executor(1)@10.101.131.128:55377 > I0330 16:28:00.443361 18422 slave.cpp:3193] Executor > 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of > framework Singularity_local exited with status 0 > > Now my framework receives the TASK_KILLED (I see that) but fails to > send the ack message back to mesos. After 30 seconds: > > W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending > status update TASK_RUNNING (UUID: > ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local to the slave > I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to [email protected]:5050 > > After another 20 seconds: > > W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending > status update TASK_RUNNING (UUID: > ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local to the slave > I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to [email protected]:5050 > > so for my task " > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT", > which is finished (the executor has exited), the slave is now > resending "TASK_RUNNING". And that can not be right. This should be > "TASK_KILLED". > > As a result, the state machine in my framework is now confused, > because a task just transitioned from TASK_KILLED to TASK_RUNNING. > Which should not happen, because TASK_KILLED is a terminal state. > > So I guess that a bug in my framework code has exposed a bug in the > mesos slave state resending code. > > -h > > (BTW, I am a bit concerned that these discussions do not happen on the > mesos-dev list. I added it in Cc. This is still an Apache project, > right? (Wears his Apache hat)). > > On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen > <[email protected]> wrote: > > Hi Ben, > > > > quick question about a comment in the mesos source code: > > > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703 > > > > // If the update is driver-generated or master-generated, it > > // does not require acknowledgement and so we unset the 'uuid' > > // field of TaskStatus. Otherwise, we overwrite the field to > > // ensure that a 0.22.0 scheduler driver supports explicit > > // acknowledgements, even if running against a 0.21.0 cluster. > > // > > // TODO(bmahler): Update the slave / executor driver to ensure > > // that 'uuid' is set accurately by the time it reaches the > > // scheduler driver. This will be required for pure bindings. > > if (from == UPID() || pid == UPID()) { > > > > No matter how much I twist and turn this piece of code, but I > > interpret that as "if the sender of the message or the pid from the > > message itself is the local address". Why would that skip > > master-generated updates? > > > > In fact, my problems seem to be related to a misinterpretation of this > > statement in my java driver (jesos, https://github.com/groupon/jesos). > > However I believe that this uncovered an actual bug in the slave > > behavior: > > > > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master > > - Master sends the message to the Framework. > > - Framework code fails to acknowledge the message (which is a > > legitimate error case, because messages can get lost) > > - After a timeout, the Slave resends the TaskStatus message, however, > > it does so with state TASK_RUNNING, not TASK_KILLED. > > > > I could verify that behavior in 0.22.0 and the current trunk. I have a > > bit more debugging in a log file in the office, i will send that to > > you tomorrow. > > > > -h > > > > > > > > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman <[email protected]> > wrote: > >> Hey Henning! > >> > >> Someone mentioned today that you were discussing a potential bug in > 0.22.0 > >> on IRC? Can you elaborate? We are taking this very seriously and would > like > >> to make sure any issues are dealt with promptly. > >> > >> Ben. > > > > > > > > -- > > Henning Schmiedehausen - Groupon Engineering Principal > > Groupon Inc, Palo Alto - [email protected] > > > > -- > Henning Schmiedehausen - Groupon Engineering Principal > Groupon Inc, Palo Alto - [email protected] >
