[jira] [Comment Edited] (MESOS-6136) Duplicate framework id handling

Christopher Hunt (JIRA) Wed, 07 Sep 2016 18:57:57 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472407#comment-15472407
 ]


Christopher Hunt edited comment on MESOS-6136 at 9/8/16 1:56 AM:
-----------------------------------------------------------------

> So in a situation where a framework no longer exists...

Mesos can never be sure on whether a framework exists or not. For example, 
Mesos cannot determine if the framework has stopped for some reason, or whether 
it is just a network partition.

By comparison, Akka does not automatically "down" a cluster member in the 
situation where it becomes lost. Instead, it quarantines it requiring an 
operator to intervene (there is also a product we provided for handling split 
brain scenarios that will automatically down parts of the cluster, but I 
digress...).

I'm suggesting that Mesos also quarantines frameworks but doesn't kill tasks. 
Perhaps this could be considered "opt-in" by a framework.


was (Author: huntc):
> So in a situation where a framework no longer exists...

Mesos can never be sure on whether a framework exists or not. For example, 
Mesos cannot determine if the framework has stopped for some reason, or whether 
it is just a network partition.

> Duplicate framework id handling
> -------------------------------
>
>                 Key: MESOS-6136
>                 URL: https://issues.apache.org/jira/browse/MESOS-6136
>             Project: Mesos
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: DCOS 1.7 Cloud Formation scripts
>            Reporter: Christopher Hunt
>            Priority: Critical
>              Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-6136) Duplicate framework id handling

Reply via email to