Re: bug in 0.22.0?

Henning Schmiedehausen Tue, 31 Mar 2015 14:20:58 -0700

https://issues.apache.org/jira/browse/MESOS-2580


On Tue, Mar 31, 2015 at 11:20 AM, Niklas Nielsen <[email protected]> wrote:
> Hi Henning,
>
> It does sound suspicious.
> Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we can
> track progress and assign an owner?
>
> Thanks,
> Niklas
>
> On 31 March 2015 at 10:00, Henning Schmiedehausen
> <[email protected]> wrote:
>>
>> Ok, so here is my log file from the slave:
>> https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
>>
>> - Slave shuts down executor:
>>
>> I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local
>> I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update
>> TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local from executor(1)@10.101.131.128:55377
>> I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received
>> status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396)
>> for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> of framework Singularity_local
>> I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager
>> successfully handled status update TASK_KILLED (UUID:
>> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local
>> I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement
>> for status update TASK_KILLED (UUID:
>> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local to executor(1)@10.101.131.128:55377
>> I0330 16:28:00.443361 18422 slave.cpp:3193] Executor
>> 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of
>> framework Singularity_local exited with status 0
>>
>> Now my framework receives the TASK_KILLED (I see that) but fails to
>> send the ack message back to mesos. After 30 seconds:
>>
>> W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending
>> status update TASK_RUNNING (UUID:
>> ea431985-98de-4447-8668-fda26c95f040) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local
>> I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding
>> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
>> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> of framework Singularity_local to the slave
>> I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update
>> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local to [email protected]:5050
>>
>> After another 20 seconds:
>>
>> W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending
>> status update TASK_RUNNING (UUID:
>> ea431985-98de-4447-8668-fda26c95f040) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local
>> I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding
>> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
>> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
>> of framework Singularity_local to the slave
>> I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update
>> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
>> framework Singularity_local to [email protected]:5050
>>
>> so for my task "
>> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT",
>> which is finished (the executor has exited), the slave is now
>> resending "TASK_RUNNING". And that can not be right. This should be
>> "TASK_KILLED".
>>
>> As a result, the state machine in my framework is now confused,
>> because a task just transitioned from TASK_KILLED to TASK_RUNNING.
>> Which should not happen, because TASK_KILLED is a terminal state.
>>
>> So I guess that a bug in my framework code has exposed a bug in the
>> mesos slave state resending code.
>>
>> -h
>>
>> (BTW, I am a bit concerned that these discussions do not happen on the
>> mesos-dev list. I added it in Cc. This is still an Apache project,
>> right? (Wears his Apache hat)).
>>
>> On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen
>> <[email protected]> wrote:
>> > Hi Ben,
>> >
>> > quick question about a comment in the mesos source code:
>> >
>> > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703
>> >
>> > // If the update is driver-generated or master-generated, it
>> > // does not require acknowledgement and so we unset the 'uuid'
>> > // field of TaskStatus. Otherwise, we overwrite the field to
>> > // ensure that a 0.22.0 scheduler driver supports explicit
>> > // acknowledgements, even if running against a 0.21.0 cluster.
>> > //
>> > // TODO(bmahler): Update the slave / executor driver to ensure
>> > // that 'uuid' is set accurately by the time it reaches the
>> > // scheduler driver. This will be required for pure bindings.
>> > if (from == UPID() || pid == UPID()) {
>> >
>> > No matter how much I twist and turn this piece of code, but I
>> > interpret that as "if the sender of the message or the pid from the
>> > message itself is the local address". Why would that skip
>> > master-generated updates?
>> >
>> > In fact, my problems seem to be related to a misinterpretation of this
>> > statement in my java driver (jesos, https://github.com/groupon/jesos).
>> > However I believe that this uncovered an actual bug in the slave
>> > behavior:
>> >
>> > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master
>> > - Master sends the message to the Framework.
>> > - Framework code fails to acknowledge the message (which is a
>> > legitimate error case, because messages can get lost)
>> > - After a timeout, the Slave resends the TaskStatus message, however,
>> > it does so with state TASK_RUNNING, not TASK_KILLED.
>> >
>> > I could verify that behavior in 0.22.0 and the current trunk. I have a
>> > bit more debugging in a log file in the office, i will send that to
>> > you tomorrow.
>> >
>> > -h
>> >
>> >
>> >
>> > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman <[email protected]>
>> > wrote:
>> >> Hey Henning!
>> >>
>> >> Someone mentioned today that you were discussing a potential bug in
>> >> 0.22.0
>> >> on IRC? Can you elaborate? We are taking this very seriously and would
>> >> like
>> >> to make sure any issues are dealt with promptly.
>> >>
>> >> Ben.
>> >
>> >
>> >
>> > --
>> > Henning Schmiedehausen - Groupon Engineering Principal
>> > Groupon Inc, Palo Alto - [email protected]
>>
>>
>>
>> --
>> Henning Schmiedehausen - Groupon Engineering Principal
>> Groupon Inc, Palo Alto - [email protected]
>
>



-- 
Henning Schmiedehausen - Groupon Engineering Principal
Groupon Inc, Palo Alto - [email protected]

Re: bug in 0.22.0?

Reply via email to