[
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085192#comment-16085192
]
Yan Xu commented on MESOS-6223:
-------------------------------
{noformat:title=}
commit 188109b63ea9cc0cdfe1fd616c744cb10dbb4a57
Author: Megha Sharma <[email protected]>
Date: Wed Jul 12 22:03:37 2017 -0700
Added tests to ensure slave recovery post reboot.
Added tests to verify that the state is recovered post reboot and the
agent ID is retained given the recovery finishes without errors and
if the recovery fails due to agent info mismatch then agent is recoverd
as a new agent.
Review: https://reviews.apache.org/r/56895/
commit cd6495e677ec74fd3f40b0dbf3b9654475308575
Author: Megha Sharma <[email protected]>
Date: Mon Jul 10 09:38:28 2017 -0700
Recover as a new agent in case of agent info mismatch on reboot.
This is for backwards compatibility. Prior to Mesos 1.4 we directly
bypass the state recovery and start as a new agent upon reboot
(introduced in MESOS-844). This unnecessarily discards the existing
agent ID (MESOS-6223). Starting in Mesos 1.4 we'll attempt to recover
the slave state even after reboot but in case of slave info mismatch
we'll fall back to recovering as a new agent (existing behavior). This
prevents the agent from flapping if the agent info (resources,
attributes, etc.) change is due to host maintenance associated with
the reboot.
Review: https://reviews.apache.org/r/60105/
commit 91f4e9acd0bad60201155b68a896d12d7200eda3
Author: Megha Sharma <[email protected]>
Date: Mon Jul 10 09:34:40 2017 -0700
Stopped short-circuiting agent recovery upon reboot.
The agent would continue the recovery and we added a `rebooted` flag
to `slave::State` to record the reboot info.
Review: https://reviews.apache.org/r/60104/
{noformat}
Still need a patch for CHANGELOG and upgrades.md before resolving.
> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
> Issue Type: Improvement
> Components: agent
> Reporter: Megha Sharma
> Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the
> master and gets a new SlaveID. With partition awareness, the agents are now
> allowed to re-register after they have been marked Unreachable. The executors
> are anyway terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master and reconcile
> the lost executors. This is a pre-requisite for supporting
> persistent/restartable tasks in mesos (MESOS-3545).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)