[
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870821#comment-15870821
]
Yan Xu commented on MESOS-6223:
-------------------------------
>From my comment on the email thread:
{quote}
So one thing that was brought up during offline conversations was that if the
host reboot is associated with hardware change (e.g., a new memory stick):
Currently: the agent would skip the recovery (and the chance of running into
incompatible agent info) and register as a new agent.
With the change: the agent could run into incompatible agent info due to
resource change and flap indefinitely until the operator intervenes.
To mitigate this and maintain the current behavior, we can have the agent
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
failure but only after the host has rebooted. This way the agent can restart as
a new agent without operator intervention.
{quote}
Of course, even if we do this to maintain the current behavior, it remain true
that relying on reboot as a signal for hardware change is not reliable but the
fix should be MESOS-1739.
> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
> Issue Type: Improvement
> Components: agent
> Reporter: Megha Sharma
> Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the
> master and gets a new SlaveID. With partition awareness, the agents are now
> allowed to re-register after they have been marked Unreachable. The executors
> are anyway terminated on the agent when it reboots so there is no harm in
> letting the agent keep its SlaveID, re-register with the master and reconcile
> the lost executors. This is a pre-requisite for supporting
> persistent/restartable tasks in mesos (MESOS-3545).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)