[ 
https://issues.apache.org/jira/browse/MESOS-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941770#comment-14941770
 ] 

Megha commented on MESOS-3545:
------------------------------

I have spent some time looking into how to solve a subset of the problem i.e. 
keeping the tasks/executors running and in a recoverable state for an extended 
period of time for slaves that disconnect and reconnect in due time such as due 
to a temporary network blip. I will be sending out a design soon.


> Investigate restoring tasks/executors after machine reboot.
> -----------------------------------------------------------
>
>                 Key: MESOS-3545
>                 URL: https://issues.apache.org/jira/browse/MESOS-3545
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Benjamin Hindman
>              Labels: mesosphere
>
> If a task/executor is restartable (see MESOS-3544) it might make sense to 
> force an agent to restart these tasks/executors _before_ after a machine 
> reboot in the event that the machine is network partitioned away from the 
> master (or the master has failed) but we'd like to get these services running 
> again. Assuming the agent(s) running on the machine has not been disconnected 
> from the master for longer than the master's agent re-registration timeout 
> the agent should be able to re-register (i.e., after a network partition is 
> resolved) without a problem. However, in the same way that a framework would 
> be interested in knowing that it's tasks/executors were restarted we'd want 
> to send something like a TASK_RESTARTED status update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to