[ 
https://issues.apache.org/jira/browse/MESOS-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693204#comment-13693204
 ] 

Benjamin Mahler commented on MESOS-525:
---------------------------------------

Punting for now as this issue was related to the detector bugs that have since 
been fixed:

There was an instance of an induced timeout:
W0618 23:55:05.160228 15376 detector.cpp:435] Timed out waiting to reconnect to 
ZooKeeper (sessionId=33f2f2c18cc56a0)
This means the slave was not notified of the partition.

The partition was restored a bit later:
I0619 01:43:16.916136 15371 detector.cpp:485] Master detector 
(slave(1)@10.36.78.129:5051)  found 3 registered masters

And previously, the slave does not get notified of this event.

Since then, in 0.13.0 (will be rolling out soon), this issue would have not 
occurred, because we've since then fixed some of these issues:
1. The slave would be notified of the partition.
2. Upon restoration of the partition, the slave will re-register with the 
master.
3. The master will disallow the slave from re-registering, causing the slave to 
roll, killing all tasks underneath it.
                
> Slave should kill tasks when disconnected from the master for longer than the 
> health check timeout.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-525
>                 URL: https://issues.apache.org/jira/browse/MESOS-525
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> The following scenario was observed in production at Twitter:
> 1. Task T beings running on a slave at
> I0618 02:54:38.069694 15362 slave.cpp:830] Status update: task T of framework 
> F is now in state TASK_RUNNING
> 2. Due to a network partition, the slave is removed from the master for 
> failing health checks:
> W0618 23:56:18.063217 28745 master.cpp:1172] Removing slave 
> 201304011727-2230002186-5050-28738-3217 at S:5051 because it has been 
> deactivated
> I0618 23:56:18.068821 28745 master.cpp:1181] Master now considering a slave 
> at S:5051 as inactive
> 3. The task stayed running on the partitioned slave for 6 days! Until a user 
> manually killed the process and the executor marked it as finished:
> I0624 20:20:57.565053 15380 slave.cpp:830] Status update: task 
> 1371524058397-ads-adshard-production-153-a4504eb0-384b-4600-b6fe-e080c87bd84e 
> of framework 201104070004-0000002563-0000 is now in state TASK_FINISHED
> There are a few ways to fix this in the slave, these rely on the fact that 
> the master will have marked the tasks as LOST when it removed the slave, 
> after which point we don't want the tasks to continue running.
>   1. Have the slave commit suicide after (<health_check_failure_timeout> + 
> buffer) amount of time of disconnection with the master. This only works well 
> when cgroups is in use to ensure the next run of the slave cleans up 
> properly. And this gets messier with slave recovery.
>   2. A cleaner approach would be to have the slave kill all executors running 
> under it. We most likely want to send TASK_LOST updates for the tasks 
> although this will mean duplicate updates unless the master handles these 
> correctly. Alternatively, we can avoid sending any updates, but we'll need to 
> guarantee that the updates were sent by the master.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to