[ 
https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606261#comment-15606261
 ] 

Neil Conway commented on MESOS-6078:
------------------------------------

FYI, we will likely address this as part of the in-progress work on supporting 
{{TASK_GONE}} and {{TASK_GONE_BY_OPERATOR}}. Workflow:

* framework opts-in to the {{PARTITION_AWARE}} capability.
* if Mesos can _prove_ that the agent ID is gone (e.g., because the agent 
reboots, changes its boot ID, and then an agent using the same {{work_dir}} 
registers and receives a new agent ID), the framework will get {{TASK_GONE}} 
status updates for all tasks on the agent.
* if the operator has some out-of-band knowledge that the agent will never 
attempt to re-register and all of its tasks are no longer running, we'll 
provide an operator HTTP endpoint (e.g., /agent/gone) that the operator can 
hit. When this happens, the framework will receive {{TASK_GONE_BY_OPERATOR}} 
status updates for all tasks on the agent.

In the meantime, the {{/machine/down}} endpoint might help here -- it shouldn't 
be subject to the agent removal rate limit.

> Add a agent teardown endpoint
> -----------------------------
>
>                 Key: MESOS-6078
>                 URL: https://issues.apache.org/jira/browse/MESOS-6078
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Cody Maloney
>            Assignee: Michael Park
>              Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good 
> (AWS terminated the instance without warning), it goes through the mesos 
> slave removal rate limit before it's gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, 
> this can get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but 
> once the agent is gone, there currently is no good way for an adminitstrator 
> to indicate the node is gone / gone and it's tasks are lost / should be 
> rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to