[
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483578#comment-15483578
]
Neil Conway commented on MESOS-6136:
------------------------------------
Can you clarify which behavior you're referring to when you say that Mesos will
"prevent a framework from rejoining given the inconsistent state of tasks"?
In my mind, a framework ID basically identifies a "framework session". A
framework session is created when a framework registers for the first time (and
doesn't provide an ID). Current behavior:
* session continues until _either_ {{/teardown}} is used or the framework is
disconnected for longer than {{failover_timeout}}
* to resume a session from a new connection, you just specify the framework ID
when registering with the master.
Proposed change in behavior:
* session continues indefinitely until explicit {{/teardown}}
* either support for {{failover_timeout}} is deprecated/removed, or we just
have an infinite {{failover_timeout}} by default, not sure.
* we look at enhancing the usability of {{/teardown}} or making it easier to
identify/terminate tasks associated with orphan framework IDs, as needed.
> Duplicate framework id handling
> -------------------------------
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
> Issue Type: Improvement
> Components: general
> Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
> Reporter: Christopher Hunt
> Priority: Critical
> Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a
> framework where that framework times out with the Mesos master for some
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's
> tasks for practical purposes, I'm wondering if there's an improvement where a
> framework shouldn't be permitted to re-register for a given id (as now), but
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on
> keeping tasks running in a system, and it isn't; and b) Mesos is working as
> designed.
> In summary I feel that Mesos is taking on a responsibility in killing tasks
> where it shouldn't be.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)