[jira] [Commented] (MESOS-6136) Duplicate framework id handling

Neil Conway (JIRA) Thu, 08 Sep 2016 04:51:07 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15473653#comment-15473653
 ]


Neil Conway commented on MESOS-6136:
------------------------------------

We might want to distinguish between "framework has been explicitly torn down" 
(via the {{/teardown}} endpoint) and "framework has been disconnected for 
longer than {{failover_timeout}}". In the former case, the operator has 
explicitly removed the framework, so it seems quite reasonable for Mesos to 
kill the associated tasks (and we should arrange to do this even for tasks 
running on agents that are partitioned at the time of the {{/teardown}}). In 
the latter case, having Mesos kill tasks at any point is more debatable. 
Obviously the recommended practice for production frameworks is to set a high 
{{failover_timeout}}. We could perhaps change this behavior: e.g., deprecate 
{{failover_timeout}}, and say that tasks associated with disconnected 
frameworks continue running indefinitely until/unless killed by the operator. 
As part of this, we would probably want to provide better support for cleaning 
up the state associated with such a disconnected framework -- e.g., allowing 
{{/teardown}} to be used for this purpose.

> Duplicate framework id handling
> -------------------------------
>
>                 Key: MESOS-6136
>                 URL: https://issues.apache.org/jira/browse/MESOS-6136
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: DCOS 1.7 Cloud Formation scripts
>            Reporter: Christopher Hunt
>            Priority: Critical
>              Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-6136) Duplicate framework id handling

Reply via email to