Gabriel Hartmann created MESOS-4306:
---------------------------------------

             Summary: AGENT_DEAD Message
                 Key: MESOS-4306
                 URL: https://issues.apache.org/jira/browse/MESOS-4306
             Project: Mesos
          Issue Type: Task
            Reporter: Gabriel Hartmann


Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
behind a network partition for some period of time.  However frameworks and 
indeed Mesos cannot differentiate between an Agent being temporarily or 
permanently lost.

It would be good to have a message indicating that an Agent is lost and won't 
be returning.  This would require human intervention so an endpoint should be 
exposed to induce the sending of this message.

This is particularly helpful for frameworks which are waiting for the return of 
persistent volumes.  In the case where an Agent hosting significant data (multi 
terabyte) the framework may be willing to wait a significant amount of time 
before repairing its replication factor (for example).  Explicit human provided 
information about the permanent state of Agents and therefore their resources 
would allow these kinds of frameworks to accelerate their recovery timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to