[
https://issues.apache.org/jira/browse/MESOS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038509#comment-17038509
]
Charles commented on MESOS-4659:
--------------------------------
[~vinodkone]
Is there any chance to implement the workaround you suggested above - setting
up a timer upon framework recovery to automatically teardown the framework
after the failover timeout if it hasn't re-registered by then?
It'd be really interesting for us as we sometimes run into a problem where the
agents would keep spamming the master about finished tasks for an unregistered
framework.
If you don't have the bandwidth to work on it I could maybe work on a patch?
Cheers,
> Avoid leaving orphan task after framework failure + master failover
> -------------------------------------------------------------------
>
> Key: MESOS-4659
> URL: https://issues.apache.org/jira/browse/MESOS-4659
> Project: Mesos
> Issue Type: Bug
> Components: master
> Reporter: Neil Conway
> Priority: Major
> Labels: failover, mesosphere
>
> If a framework becomes disconnected from the master, its tasks are killed
> after waiting for {{failover_timeout}}.
> However, if a master failover occurs but a framework never reconnects to the
> new master, we never kill any of the tasks associated with that framework.
> These tasks remain orphaned and presumably would need to be manually removed
> by the operator. Similarly, if a framework gets torn down or disconnects
> while it has running tasks on a partitioned agent, those tasks are not
> shutdown when the agent reregisters.
> We should consider whether to kill such orphaned tasks automatically, likely
> after waiting for some (framework-configurable?) timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)