[ 
https://issues.apache.org/jira/browse/MESOS-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088014#comment-15088014
 ] 

Joseph Wu commented on MESOS-4306:
----------------------------------

I don't think you need another message in this case.

With maintenance (in 0.25), an operator can set a unavailability period of 
infinity to denote the same semantics as {{AGENT_DEAD}} (or rather, 
{{AGENT_TO_BE_KILLED}}?).  The framework would be notified of this in advance 
via inverse offers.

When the agent actually gets terminated (by the operator), the framework will 
see a {{SLAVE_LOST}} (in HTTP API-land, {{Event::FAILURE}}).

Would it help to add maintenance info to {{Event::FAILURE}} too?  i.e. In case 
a machine is taken down before any inverse offers get sent.

> AGENT_DEAD Message
> ------------------
>
>                 Key: MESOS-4306
>                 URL: https://issues.apache.org/jira/browse/MESOS-4306
>             Project: Mesos
>          Issue Type: Task
>            Reporter: Gabriel Hartmann
>
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is 
> behind a network partition for some period of time.  However frameworks and 
> indeed Mesos cannot differentiate between an Agent being temporarily or 
> permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't 
> be returning.  This would require human intervention so an endpoint should be 
> exposed to induce the sending of this message.
> This is particularly helpful for frameworks which are waiting for the return 
> of persistent volumes.  In the case where an Agent hosting significant data 
> (multi terabyte) the framework may be willing to wait a significant amount of 
> time before repairing its replication factor (for example).  Explicit human 
> provided information about the permanent state of Agents and therefore their 
> resources would allow these kinds of frameworks to accelerate their recovery 
> timelines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to