[ 
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-4659:
-------------------------------
    Description: 
If a framework becomes disconnected from the master, its tasks are killed after 
waiting for {{failover_timeout}}.

However, if a master failover occurs but a framework never reconnects to the 
new master, we never kill any of the tasks associated with that framework. 
These tasks remain orphaned and presumably would need to be manually removed by 
the operator. Similarly, if a framework gets torn down or disconnects while it 
has running tasks on a partitioned agent, those tasks are not shutdown when the 
agent reregisters.

We should consider whether to kill such orphaned tasks automatically, likely 
after waiting for some (framework-configurable?) timeout.

  was:
If a framework becomes disconnected from the master, its tasks are killed after 
waiting for {{failover_timeout}}.

However, if a master failover occurs but a framework never reconnects to the 
new master, we never kill any of the tasks associated with that framework. 
These tasks remain orphaned and presumably would need to be manually removed by 
the operator.

We should consider whether to kill such orphaned tasks automatically, likely 
after waiting for some (framework-configurable?) timeout.


> Avoid leaving orphan task after framework failure + master failover
> -------------------------------------------------------------------
>
>                 Key: MESOS-4659
>                 URL: https://issues.apache.org/jira/browse/MESOS-4659
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>              Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed 
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the 
> new master, we never kill any of the tasks associated with that framework. 
> These tasks remain orphaned and presumably would need to be manually removed 
> by the operator. Similarly, if a framework gets torn down or disconnects 
> while it has running tasks on a partitioned agent, those tasks are not 
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely 
> after waiting for some (framework-configurable?) timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to