[ 
https://issues.apache.org/jira/browse/MESOS-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13693181#comment-13693181
 ] 

Benjamin Mahler commented on MESOS-525:
---------------------------------------

As [~wfarner] pointed out, we need to be careful here to not cause a 
cluster-wide outage, say, if the master is out of commission for several 
minutes.
                
> Slave should kill tasks when disconnected from the master for longer than the 
> health check timeout.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-525
>                 URL: https://issues.apache.org/jira/browse/MESOS-525
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>
> The following scenario was observed in production at Twitter:
> 1. Task T beings running on a slave at
> I0618 02:54:38.069694 15362 slave.cpp:830] Status update: task T of framework 
> F is now in state TASK_RUNNING
> 2. Due to a network partition, the slave is removed from the master for 
> failing health checks:
> W0618 23:56:18.063217 28745 master.cpp:1172] Removing slave 
> 201304011727-2230002186-5050-28738-3217 at S:5051 because it has been 
> deactivated
> I0618 23:56:18.068821 28745 master.cpp:1181] Master now considering a slave 
> at S:5051 as inactive
> 3. The task stayed running on the partitioned slave for 6 days! Until a user 
> manually killed the process and the executor marked it as finished:
> I0624 20:20:57.565053 15380 slave.cpp:830] Status update: task 
> 1371524058397-ads-adshard-production-153-a4504eb0-384b-4600-b6fe-e080c87bd84e 
> of framework 201104070004-0000002563-0000 is now in state TASK_FINISHED
> There are a few ways to fix this in the slave, these rely on the fact that 
> the master will have marked the tasks as LOST when it removed the slave, 
> after which point we don't want the tasks to continue running.
>   1. Have the slave commit suicide after (<health_check_failure_timeout> + 
> buffer) amount of time of disconnection with the master. This only works well 
> when cgroups is in use to ensure the next run of the slave cleans up 
> properly. And this gets messier with slave recovery.
>   2. A cleaner approach would be to have the slave kill all executors running 
> under it. We most likely want to send TASK_LOST updates for the tasks 
> although this will mean duplicate updates unless the master handles these 
> correctly. Alternatively, we can avoid sending any updates, but we'll need to 
> guarantee that the updates were sent by the master.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to