[
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553432#comment-15553432
]
Yan Xu edited comment on MESOS-6223 at 10/6/16 10:48 PM:
---------------------------------------------------------
[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks
post-reboot (MESOS-3545, will have design doc out soon) via either the approach
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special
case sounds to me an optimization which will no longer hold true with tasks
being restarted. Then the question is
1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?
1) Sounds like no.
For 2), on the master the only error case where we disallow an agent to
reregister but does allow the agent to register is [when the agent's ip or
hostname has
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]
(hostname change already prevents the agent from restarting). I can imagine
we'd want to force the agent to get rid of its {{work_dir/<slaves>/slave_id}}
but keep the checkpointed resources etc.?
To summarize, seems like we can keep both this ticket and MESOS-5368, but
change MESOS-5368 to not change the session ID in the reboot case?
Thoughts?
was (Author: xujyan):
[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks
post-reboot (MESOS-3545, will have design doc out soon) via either the approach
in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special
case sounds to me an optimization which will no longer hold true with tasks
being restarted. Then the question is
1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?
1) Sounds like no.
For 2), on the master the only error case where we disallow an agent to
reregister but does allow the agent to register is [when the agent's ip or
hostname has
changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228].
I can imagine we'd want to force the agent to get rid of its
{{work_dir/<slaves>/slave_id}} but keep the checkpointed resources etc.?
To summarize, seems like we can keep both this ticket and MESOS-5368, but
change MESOS-5368 to not change the session ID in the reboot case?
Thoughts?
> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
> Issue Type: Improvement
> Components: slave
> Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the
> master and gets a new SlaveID. With partition awareness, the agents are now
> allowed to re-register after they have been marked Unreachable. The executors
> are anyway terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master and reconcile
> the lost executors. This is a pre-requisite for supporting
> persistent/restartable tasks in mesos (MESOS-3545).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)