Christopher Hunt created MESOS-6136:
---------------------------------------

             Summary: Duplicate framework id handling
                 Key: MESOS-6136
                 URL: https://issues.apache.org/jira/browse/MESOS-6136
             Project: Mesos
          Issue Type: Improvement
          Components: general
    Affects Versions: 0.28.1
         Environment: DCOS 1.7 Cloud Formation scripts
            Reporter: Christopher Hunt
            Priority: Critical


We have observed a situation where Mesos will kill tasks belonging to a 
framework where that framework times out with the Mesos master for some reason, 
perhaps even because of a network partition.

While we can provide a long timeout so that Mesos will not kill a framework's 
tasks for practical purposes, I'm wondering if there's an improvement where a 
framework shouldn't be permitted to re-register for a given id (as now), but 
Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be "told" 
by an operator that this condition should be cleared.

IMHO frameworks should be the only entity requesting that tasks be killed 
unless manually overridden by an operator.

I'm flagging this as a critical improvement because a) the focus should be on 
keeping tasks running in a system, and it isn't; and b) Mesos is working as 
designed. 

In summary I feel that Mesos is taking on a responsibility in killing tasks 
where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to