[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553432#comment-15553432
 ] 

Yan Xu edited comment on MESOS-6223 at 10/6/16 10:48 PM:
---------------------------------------------------------

[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]
 (hostname change already prevents the agent from restarting). I can imagine 
we'd want to force the agent to get rid of its {{work_dir/<slaves>/slave_id}} 
but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?


was (Author: xujyan):
[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks 
post-reboot (MESOS-3545, will have design doc out soon) via either the approach 
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special 
case sounds to me an optimization which will no longer hold true with tasks 
being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to 
reregister but does allow the agent to register is [when the agent's ip or 
hostname has 
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228].
 I can imagine we'd want to force the agent to get rid of its 
{{work_dir/<slaves>/slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but 
change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?

> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
>                 Key: MESOS-6223
>                 URL: https://issues.apache.org/jira/browse/MESOS-6223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to