Ah I guess you're using a custom language binding that is speaking the internal protobuf protocol? 'latest_state' is an internal message field that is hidden from the scheduler api intentionally, please don't use it.
On Tue, Mar 31, 2015 at 3:30 PM, Henning Schmiedehausen < [email protected]> wrote: > Hm, > > so after replaying some of my logs, I am not entirely sure. This may > be a design decision in the mesos core. Maybe you can help me out > here: If an event is not acked, it gets repeated. This is what happens > right now. If an event never gets acked (e.g. the first TASK_RUNNING), > what will happen if as task falls into "TASK_KILLED"? Will the slave > still try to deliver the TASK_RUNNING first (and then the TASK_KILLED > once that got acked?)? Or would it discarded the older (TASK_RUNNING) > message which is now superseded? > > The messages that I see are: > > DEBUG [2015-03-31 19:57:48,971] > com.groupon.mesos.util.HttpProtocolReceiver: Received from > [email protected]:5050: update { > framework_id { > value: "Singularity_local" > } > executor_id { > value: > "MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1" > } > slave_id { > value: "20150331-113503-2156094730-5050-13210-S0" > } > status { > task_id { > value: > "MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1" > } > state: TASK_RUNNING > slave_id { > value: "20150331-113503-2156094730-5050-13210-S0" > } > timestamp: 1.427831718964869E9 > source: SOURCE_EXECUTOR > } > timestamp: 1.427831718964869E9 > uuid: "\346>H\317\202\361G\022\265\001)\251\037\210\256\234" > latest_state: TASK_KILLED > } > > which implies, that this is a resend of the original "TASK_RUNNING" > message but the "latest_state" is the latest state in time (while the > TASK_RUNNING is still sitting in the message queue). > > So, if that is the case, then there are actually is no bug in Mesos. > Once the framework correctly acks the messages, they are delivered > correctly. > > -h > > > > On Tue, Mar 31, 2015 at 11:28 AM, Benjamin Mahler > <[email protected]> wrote: > > From the slave logs you posted here: > > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd > > > > The slave received updates for RUNNING followed by KILLED from the > executor. > > The slave was only forwarding RUNNING to the master as it didn't receive > an > > ack from the framework. Why do you think that KILLED was being forwarded > by > > the slave? I don't see that in these logs (note the "Forwarding" lines). > > > > On Tue, Mar 31, 2015 at 11:20 AM, Niklas Nielsen <[email protected]> > > wrote: > >> > >> Hi Henning, > >> > >> It does sound suspicious. > >> Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we > can > >> track progress and assign an owner? > >> > >> Thanks, > >> Niklas > >> > >> On 31 March 2015 at 10:00, Henning Schmiedehausen < > >> [email protected]> wrote: > >> > >> > Ok, so here is my log file from the slave: > >> > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd > >> > > >> > - Slave shuts down executor: > >> > > >> > I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local > >> > I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update > >> > TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local from executor(1)@10.101.131.128:55377 > >> > I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > >> > I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received > >> > status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) > >> > for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > >> > of framework Singularity_local > >> > I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager > >> > successfully handled status update TASK_KILLED (UUID: > >> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local > >> > I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement > >> > for status update TASK_KILLED (UUID: > >> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local to executor(1)@10.101.131.128:55377 > >> > I0330 16:28:00.443361 18422 slave.cpp:3193] Executor > >> > 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of > >> > framework Singularity_local exited with status 0 > >> > > >> > Now my framework receives the TASK_KILLED (I see that) but fails to > >> > send the ack message back to mesos. After 30 seconds: > >> > > >> > W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending > >> > status update TASK_RUNNING (UUID: > >> > ea431985-98de-4447-8668-fda26c95f040) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local > >> > I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding > >> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > >> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > >> > of framework Singularity_local to the slave > >> > I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update > >> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local to [email protected]:5050 > >> > > >> > After another 20 seconds: > >> > > >> > W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending > >> > status update TASK_RUNNING (UUID: > >> > ea431985-98de-4447-8668-fda26c95f040) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local > >> > I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding > >> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > >> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > >> > of framework Singularity_local to the slave > >> > I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update > >> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > >> > framework Singularity_local to [email protected]:5050 > >> > > >> > so for my task " > >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT", > >> > which is finished (the executor has exited), the slave is now > >> > resending "TASK_RUNNING". And that can not be right. This should be > >> > "TASK_KILLED". > >> > > >> > As a result, the state machine in my framework is now confused, > >> > because a task just transitioned from TASK_KILLED to TASK_RUNNING. > >> > Which should not happen, because TASK_KILLED is a terminal state. > >> > > >> > So I guess that a bug in my framework code has exposed a bug in the > >> > mesos slave state resending code. > >> > > >> > -h > >> > > >> > (BTW, I am a bit concerned that these discussions do not happen on the > >> > mesos-dev list. I added it in Cc. This is still an Apache project, > >> > right? (Wears his Apache hat)). > >> > > >> > On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen > >> > <[email protected]> wrote: > >> > > Hi Ben, > >> > > > >> > > quick question about a comment in the mesos source code: > >> > > > >> > > >> > > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703 > >> > > > >> > > // If the update is driver-generated or master-generated, it > >> > > // does not require acknowledgement and so we unset the 'uuid' > >> > > // field of TaskStatus. Otherwise, we overwrite the field to > >> > > // ensure that a 0.22.0 scheduler driver supports explicit > >> > > // acknowledgements, even if running against a 0.21.0 cluster. > >> > > // > >> > > // TODO(bmahler): Update the slave / executor driver to ensure > >> > > // that 'uuid' is set accurately by the time it reaches the > >> > > // scheduler driver. This will be required for pure bindings. > >> > > if (from == UPID() || pid == UPID()) { > >> > > > >> > > No matter how much I twist and turn this piece of code, but I > >> > > interpret that as "if the sender of the message or the pid from the > >> > > message itself is the local address". Why would that skip > >> > > master-generated updates? > >> > > > >> > > In fact, my problems seem to be related to a misinterpretation of > this > >> > > statement in my java driver (jesos, > https://github.com/groupon/jesos). > >> > > However I believe that this uncovered an actual bug in the slave > >> > > behavior: > >> > > > >> > > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to > >> > > master > >> > > - Master sends the message to the Framework. > >> > > - Framework code fails to acknowledge the message (which is a > >> > > legitimate error case, because messages can get lost) > >> > > - After a timeout, the Slave resends the TaskStatus message, > however, > >> > > it does so with state TASK_RUNNING, not TASK_KILLED. > >> > > > >> > > I could verify that behavior in 0.22.0 and the current trunk. I > have a > >> > > bit more debugging in a log file in the office, i will send that to > >> > > you tomorrow. > >> > > > >> > > -h > >> > > > >> > > > >> > > > >> > > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman < > [email protected]> > >> > wrote: > >> > >> Hey Henning! > >> > >> > >> > >> Someone mentioned today that you were discussing a potential bug in > >> > 0.22.0 > >> > >> on IRC? Can you elaborate? We are taking this very seriously and > >> > >> would > >> > like > >> > >> to make sure any issues are dealt with promptly. > >> > >> > >> > >> Ben. > >> > > > >> > > > >> > > > >> > > -- > >> > > Henning Schmiedehausen - Groupon Engineering Principal > >> > > Groupon Inc, Palo Alto - [email protected] > >> > > >> > > >> > > >> > -- > >> > Henning Schmiedehausen - Groupon Engineering Principal > >> > Groupon Inc, Palo Alto - [email protected] > >> > > > > > > > > > -- > Henning Schmiedehausen - Groupon Engineering Principal > Groupon Inc, Palo Alto - [email protected] >
