Re: bug in 0.22.0?

Benjamin Mahler Tue, 31 Mar 2015 15:50:58 -0700

Ah I guess you're using a custom language binding that is speaking the
internal protobuf protocol? 'latest_state' is an internal message field
that is hidden from the scheduler api intentionally, please don't use it.


On Tue, Mar 31, 2015 at 3:30 PM, Henning Schmiedehausen <
[email protected]> wrote:

> Hm,
>
> so after replaying some of my logs, I am not entirely sure. This may
> be a design decision in the mesos core. Maybe you can help me out
> here: If an event is not acked, it gets repeated. This is what happens
> right now. If an event never gets acked (e.g. the first TASK_RUNNING),
> what will happen if as task falls into "TASK_KILLED"? Will the slave
> still try to deliver the TASK_RUNNING first (and then the TASK_KILLED
> once that got acked?)? Or would it discarded the older (TASK_RUNNING)
> message which is now superseded?
>
> The messages that I see are:
>
> DEBUG [2015-03-31 19:57:48,971]
> com.groupon.mesos.util.HttpProtocolReceiver: Received from
> [email protected]:5050: update {
>   framework_id {
>     value: "Singularity_local"
>   }
>   executor_id {
>     value:
> "MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1"
>   }
>   slave_id {
>     value: "20150331-113503-2156094730-5050-13210-S0"
>   }
>   status {
>     task_id {
>       value:
> "MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1"
>     }
>     state: TASK_RUNNING
>     slave_id {
>       value: "20150331-113503-2156094730-5050-13210-S0"
>     }
>     timestamp: 1.427831718964869E9
>     source: SOURCE_EXECUTOR
>   }
>   timestamp: 1.427831718964869E9
>   uuid: "\346>H\317\202\361G\022\265\001)\251\037\210\256\234"
>   latest_state: TASK_KILLED
> }
>
> which implies, that this is a resend of the original "TASK_RUNNING"
> message but the "latest_state" is the latest state in time (while the
> TASK_RUNNING is still sitting in the message queue).
>
> So, if that is the case, then there are actually is no bug in Mesos.
> Once the framework correctly acks the messages, they are delivered
> correctly.
>
> -h
>
>
>
> On Tue, Mar 31, 2015 at 11:28 AM, Benjamin Mahler
> <[email protected]> wrote:
> > From the slave logs you posted here:
> > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
> >
> > The slave received updates for RUNNING followed by KILLED from the
> executor.
> > The slave was only forwarding RUNNING to the master as it didn't receive
> an
> > ack from the framework. Why do you think that KILLED was being forwarded
> by
> > the slave? I don't see that in these logs (note the "Forwarding" lines).
> >
> > On Tue, Mar 31, 2015 at 11:20 AM, Niklas Nielsen <[email protected]>
> > wrote:
> >>
> >> Hi Henning,
> >>
> >> It does sound suspicious.
> >> Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we
> can
> >> track progress and assign an owner?
> >>
> >> Thanks,
> >> Niklas
> >>
> >> On 31 March 2015 at 10:00, Henning Schmiedehausen <
> >> [email protected]> wrote:
> >>
> >> > Ok, so here is my log file from the slave:
> >> > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
> >> >
> >> > - Slave shuts down executor:
> >> >
> >> > I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local
> >> > I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update
> >> > TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local from executor(1)@10.101.131.128:55377
> >> > I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> >> > I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received
> >> > status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396)
> >> > for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> >> > of framework Singularity_local
> >> > I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager
> >> > successfully handled status update TASK_KILLED (UUID:
> >> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local
> >> > I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement
> >> > for status update TASK_KILLED (UUID:
> >> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local to executor(1)@10.101.131.128:55377
> >> > I0330 16:28:00.443361 18422 slave.cpp:3193] Executor
> >> > 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of
> >> > framework Singularity_local exited with status 0
> >> >
> >> > Now my framework receives the TASK_KILLED (I see that) but fails to
> >> > send the ack message back to mesos. After 30 seconds:
> >> >
> >> > W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending
> >> > status update TASK_RUNNING (UUID:
> >> > ea431985-98de-4447-8668-fda26c95f040) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local
> >> > I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding
> >> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
> >> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> >> > of framework Singularity_local to the slave
> >> > I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update
> >> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local to [email protected]:5050
> >> >
> >> > After another 20 seconds:
> >> >
> >> > W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending
> >> > status update TASK_RUNNING (UUID:
> >> > ea431985-98de-4447-8668-fda26c95f040) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local
> >> > I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding
> >> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
> >> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> >> > of framework Singularity_local to the slave
> >> > I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update
> >> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> >> > framework Singularity_local to [email protected]:5050
> >> >
> >> > so for my task "
> >> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT",
> >> > which is finished (the executor has exited), the slave is now
> >> > resending "TASK_RUNNING". And that can not be right. This should be
> >> > "TASK_KILLED".
> >> >
> >> > As a result, the state machine in my framework is now confused,
> >> > because a task just transitioned from TASK_KILLED to TASK_RUNNING.
> >> > Which should not happen, because TASK_KILLED is a terminal state.
> >> >
> >> > So I guess that a bug in my framework code has exposed a bug in the
> >> > mesos slave state resending code.
> >> >
> >> > -h
> >> >
> >> > (BTW, I am a bit concerned that these discussions do not happen on the
> >> > mesos-dev list. I added it in Cc. This is still an Apache project,
> >> > right? (Wears his Apache hat)).
> >> >
> >> > On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen
> >> > <[email protected]> wrote:
> >> > > Hi Ben,
> >> > >
> >> > > quick question about a comment in the mesos source code:
> >> > >
> >> >
> >> >
> https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703
> >> > >
> >> > > // If the update is driver-generated or master-generated, it
> >> > > // does not require acknowledgement and so we unset the 'uuid'
> >> > > // field of TaskStatus. Otherwise, we overwrite the field to
> >> > > // ensure that a 0.22.0 scheduler driver supports explicit
> >> > > // acknowledgements, even if running against a 0.21.0 cluster.
> >> > > //
> >> > > // TODO(bmahler): Update the slave / executor driver to ensure
> >> > > // that 'uuid' is set accurately by the time it reaches the
> >> > > // scheduler driver. This will be required for pure bindings.
> >> > > if (from == UPID() || pid == UPID()) {
> >> > >
> >> > > No matter how much I twist and turn this piece of code, but I
> >> > > interpret that as "if the sender of the message or the pid from the
> >> > > message itself is the local address". Why would that skip
> >> > > master-generated updates?
> >> > >
> >> > > In fact, my problems seem to be related to a misinterpretation of
> this
> >> > > statement in my java driver (jesos,
> https://github.com/groupon/jesos).
> >> > > However I believe that this uncovered an actual bug in the slave
> >> > > behavior:
> >> > >
> >> > > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to
> >> > > master
> >> > > - Master sends the message to the Framework.
> >> > > - Framework code fails to acknowledge the message (which is a
> >> > > legitimate error case, because messages can get lost)
> >> > > - After a timeout, the Slave resends the TaskStatus message,
> however,
> >> > > it does so with state TASK_RUNNING, not TASK_KILLED.
> >> > >
> >> > > I could verify that behavior in 0.22.0 and the current trunk. I
> have a
> >> > > bit more debugging in a log file in the office, i will send that to
> >> > > you tomorrow.
> >> > >
> >> > > -h
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman <
> [email protected]>
> >> > wrote:
> >> > >> Hey Henning!
> >> > >>
> >> > >> Someone mentioned today that you were discussing a potential bug in
> >> > 0.22.0
> >> > >> on IRC? Can you elaborate? We are taking this very seriously and
> >> > >> would
> >> > like
> >> > >> to make sure any issues are dealt with promptly.
> >> > >>
> >> > >> Ben.
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Henning Schmiedehausen - Groupon Engineering Principal
> >> > > Groupon Inc, Palo Alto - [email protected]
> >> >
> >> >
> >> >
> >> > --
> >> > Henning Schmiedehausen - Groupon Engineering Principal
> >> > Groupon Inc, Palo Alto - [email protected]
> >> >
> >
> >
>
>
>
> --
> Henning Schmiedehausen - Groupon Engineering Principal
> Groupon Inc, Palo Alto - [email protected]
>

Re: bug in 0.22.0?

Reply via email to