Re: bug in 0.22.0?

Henning Schmiedehausen Wed, 01 Apr 2015 08:12:09 -0700

Hm,

so after replaying some of my logs, I am not entirely sure. This may
be a design decision in the mesos core. Maybe you can help me out
here: If an event is not acked, it gets repeated. This is what happens
right now. If an event never gets acked (e.g. the first TASK_RUNNING),
what will happen if as task falls into "TASK_KILLED"? Will the slave
still try to deliver the TASK_RUNNING first (and then the TASK_KILLED
once that got acked?)? Or would it discarded the older (TASK_RUNNING)
message which is now superseded?


The messages that I see are:

DEBUG [2015-03-31 19:57:48,971]
com.groupon.mesos.util.HttpProtocolReceiver: Received from
master@10.101.131.128:5050: update {
  framework_id {
    value: "Singularity_local"
  }
  executor_id {
    value: "MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1"
  }
  slave_id {
    value: "20150331-113503-2156094730-5050-13210-S0"
  }
  status {
    task_id {
      value: 
"MyFirstDeploy-d2913ee6-1427831718108-1-blackbox.group.on-LOCAL.R1.1"
    }
    state: TASK_RUNNING
    slave_id {
      value: "20150331-113503-2156094730-5050-13210-S0"
    }
    timestamp: 1.427831718964869E9
    source: SOURCE_EXECUTOR
  }
  timestamp: 1.427831718964869E9
  uuid: "\346>H\317\202\361G\022\265\001)\251\037\210\256\234"
  latest_state: TASK_KILLED
}

which implies, that this is a resend of the original "TASK_RUNNING"
message but the "latest_state" is the latest state in time (while the
TASK_RUNNING is still sitting in the message queue).

So, if that is the case, then there are actually is no bug in Mesos.
Once the framework correctly acks the messages, they are delivered
correctly.

-h



On Tue, Mar 31, 2015 at 11:28 AM, Benjamin Mahler
<benjamin.mah...@gmail.com> wrote:
> From the slave logs you posted here:
> https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
>
> The slave received updates for RUNNING followed by KILLED from the executor.
> The slave was only forwarding RUNNING to the master as it didn't receive an
> ack from the framework. Why do you think that KILLED was being forwarded by
> the slave? I don't see that in these logs (note the "Forwarding" lines).
>
> On Tue, Mar 31, 2015 at 11:20 AM, Niklas Nielsen <nik...@mesosphere.io>
> wrote:
>>
>> Hi Henning,
>>
>> It does sound suspicious.
>> Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we can
>> track progress and assign an owner?
>>
>> Thanks,
>> Niklas
>>
>> On 31 March 2015 at 10:00, Henning Schmiedehausen <
>> hschmiedehau...@groupon.com> wrote:
>>
>> > Ok, so here is my log file from the slave:
>> > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
>> >
>> > - Slave shuts down executor:
>> >
>> > I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local
>> > I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update
>> > TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local from executor(1)@10.101.131.128:55377
>> > I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> > I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received
>> > status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396)
>> > for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> > of framework Singularity_local
>> > I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager
>> > successfully handled status update TASK_KILLED (UUID:
>> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local
>> > I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement
>> > for status update TASK_KILLED (UUID:
>> > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local to executor(1)@10.101.131.128:55377
>> > I0330 16:28:00.443361 18422 slave.cpp:3193] Executor
>> > 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of
>> > framework Singularity_local exited with status 0
>> >
>> > Now my framework receives the TASK_KILLED (I see that) but fails to
>> > send the ack message back to mesos. After 30 seconds:
>> >
>> > W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending
>> > status update TASK_RUNNING (UUID:
>> > ea431985-98de-4447-8668-fda26c95f040) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local
>> > I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding
>> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
>> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> > of framework Singularity_local to the slave
>> > I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update
>> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local to master@10.101.131.128:5050
>> >
>> > After another 20 seconds:
>> >
>> > W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending
>> > status update TASK_RUNNING (UUID:
>> > ea431985-98de-4447-8668-fda26c95f040) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local
>> > I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding
>> > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
>> > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> > of framework Singularity_local to the slave
>> > I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update
>> > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> > framework Singularity_local to master@10.101.131.128:5050
>> >
>> > so for my task "
>> > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT",
>> > which is finished (the executor has exited), the slave is now
>> > resending "TASK_RUNNING". And that can not be right. This should be
>> > "TASK_KILLED".
>> >
>> > As a result, the state machine in my framework is now confused,
>> > because a task just transitioned from TASK_KILLED to TASK_RUNNING.
>> > Which should not happen, because TASK_KILLED is a terminal state.
>> >
>> > So I guess that a bug in my framework code has exposed a bug in the
>> > mesos slave state resending code.
>> >
>> > -h
>> >
>> > (BTW, I am a bit concerned that these discussions do not happen on the
>> > mesos-dev list. I added it in Cc. This is still an Apache project,
>> > right? (Wears his Apache hat)).
>> >
>> > On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen
>> > <hschmiedehau...@groupon.com> wrote:
>> > > Hi Ben,
>> > >
>> > > quick question about a comment in the mesos source code:
>> > >
>> >
>> > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703
>> > >
>> > > // If the update is driver-generated or master-generated, it
>> > > // does not require acknowledgement and so we unset the 'uuid'
>> > > // field of TaskStatus. Otherwise, we overwrite the field to
>> > > // ensure that a 0.22.0 scheduler driver supports explicit
>> > > // acknowledgements, even if running against a 0.21.0 cluster.
>> > > //
>> > > // TODO(bmahler): Update the slave / executor driver to ensure
>> > > // that 'uuid' is set accurately by the time it reaches the
>> > > // scheduler driver. This will be required for pure bindings.
>> > > if (from == UPID() || pid == UPID()) {
>> > >
>> > > No matter how much I twist and turn this piece of code, but I
>> > > interpret that as "if the sender of the message or the pid from the
>> > > message itself is the local address". Why would that skip
>> > > master-generated updates?
>> > >
>> > > In fact, my problems seem to be related to a misinterpretation of this
>> > > statement in my java driver (jesos, https://github.com/groupon/jesos).
>> > > However I believe that this uncovered an actual bug in the slave
>> > > behavior:
>> > >
>> > > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to
>> > > master
>> > > - Master sends the message to the Framework.
>> > > - Framework code fails to acknowledge the message (which is a
>> > > legitimate error case, because messages can get lost)
>> > > - After a timeout, the Slave resends the TaskStatus message, however,
>> > > it does so with state TASK_RUNNING, not TASK_KILLED.
>> > >
>> > > I could verify that behavior in 0.22.0 and the current trunk. I have a
>> > > bit more debugging in a log file in the office, i will send that to
>> > > you tomorrow.
>> > >
>> > > -h
>> > >
>> > >
>> > >
>> > > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman <b...@mesosphere.io>
>> > wrote:
>> > >> Hey Henning!
>> > >>
>> > >> Someone mentioned today that you were discussing a potential bug in
>> > 0.22.0
>> > >> on IRC? Can you elaborate? We are taking this very seriously and
>> > >> would
>> > like
>> > >> to make sure any issues are dealt with promptly.
>> > >>
>> > >> Ben.
>> > >
>> > >
>> > >
>> > > --
>> > > Henning Schmiedehausen - Groupon Engineering Principal
>> > > Groupon Inc, Palo Alto - henn...@groupon.com
>> >
>> >
>> >
>> > --
>> > Henning Schmiedehausen - Groupon Engineering Principal
>> > Groupon Inc, Palo Alto - henn...@groupon.com
>> >
>
>



-- 
Henning Schmiedehausen - Groupon Engineering Principal
Groupon Inc, Palo Alto - henn...@groupon.com

Re: bug in 0.22.0?

Reply via email to