Gabriel Hartmann created MESOS-4306:
---------------------------------------
Summary: AGENT_DEAD Message
Key: MESOS-4306
URL: https://issues.apache.org/jira/browse/MESOS-4306
Project: Mesos
Issue Type: Task
Reporter: Gabriel Hartmann
Frameworks currently receive SLAVE_LOST messages when an Agent fails or is
behind a network partition for some period of time. However frameworks and
indeed Mesos cannot differentiate between an Agent being temporarily or
permanently lost.
It would be good to have a message indicating that an Agent is lost and won't
be returning. This would require human intervention so an endpoint should be
exposed to induce the sending of this message.
This is particularly helpful for frameworks which are waiting for the return of
persistent volumes. In the case where an Agent hosting significant data (multi
terabyte) the framework may be willing to wait a significant amount of time
before repairing its replication factor (for example). Explicit human provided
information about the permanent state of Agents and therefore their resources
would allow these kinds of frameworks to accelerate their recovery timelines.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)