[jira] [Created] (MESOS-525) Slave should kill tasks when disconnected from the master for longer than the health check timeout.

Benjamin Mahler (JIRA) Tue, 25 Jun 2013 10:19:48 -0700

Benjamin Mahler created MESOS-525:
-------------------------------------

             Summary: Slave should kill tasks when disconnected from the master 
for longer than the health check timeout.
                 Key: MESOS-525
                 URL: https://issues.apache.org/jira/browse/MESOS-525
             Project: Mesos
          Issue Type: Bug
            Reporter: Benjamin Mahler
            Assignee: Benjamin Mahler



The following scenario was observed in production at Twitter:

1. Task T beings running on a slave at
I0618 02:54:38.069694 15362 slave.cpp:830] Status update: task T of framework F 
is now in state TASK_RUNNING

2. Due to a network partition, the slave is removed from the master for failing 
health checks:
W0618 23:56:18.063217 28745 master.cpp:1172] Removing slave 
201304011727-2230002186-5050-28738-3217 at S:5051 because it has been 
deactivated
I0618 23:56:18.068821 28745 master.cpp:1181] Master now considering a slave at 
S:5051 as inactive

3. The task stayed running on the partitioned slave for 6 days! Until a user 
manually killed the process and the executor marked it as finished:
I0624 20:20:57.565053 15380 slave.cpp:830] Status update: task 
1371524058397-ads-adshard-production-153-a4504eb0-384b-4600-b6fe-e080c87bd84e 
of framework 201104070004-0000002563-0000 is now in state TASK_FINISHED

There are a few ways to fix this in the slave, these rely on the fact that the 
master will have marked the tasks as LOST when it removed the slave, after 
which point we don't want the tasks to continue running.

  1. Have the slave commit suicide after (<health_check_failure_timeout> + 
buffer) amount of time of disconnection with the master. This only works well 
when cgroups is in use to ensure the next run of the slave cleans up properly. 
And this gets messier with slave recovery.

  2. A cleaner approach would be to have the slave kill all executors running 
under it. We most likely want to send TASK_LOST updates for the tasks although 
this will mean duplicate updates unless the master handles these correctly. 
Alternatively, we can avoid sending any updates, but we'll need to guarantee 
that the updates were sent by the master.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MESOS-525) Slave should kill tasks when disconnected from the master for longer than the health check timeout.

Reply via email to