So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick):
- Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent. - With the change: the agent could run into incompatible agent info due to resource change and flap <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280> indefinitely until the operator intervenes. To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. Any thoughts? BTW this speaks to the need for MESOS-1739. Yan On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <[email protected]> wrote: > Hi All, > > We have been working on the design for Restartable tasks ( > MESOS-3545) and allowing agents to recover and re-register post reboot is a > pre-requisite for that. > Agent today doesn’t recover its state that includes its SlaveID post a > host reboot, it short-circuits the recovery upon discovering the reboot and > registers with the master as a new agent. With Partition Awareness, the > mesos master even allows agents which have failed master’s health check > pings (unreachable agents) to re-register with it and reconcile the > tasks/executors. The executors on a rebooted host are anyway terminated so > there is no harm in letting such an agent recover and re-register with the > master using its old SlaveID. > Would like to hear from the folks here if you see any operational concerns > with letting the agents recover post a host reboot. > > MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223 > > Many Thanks > Megha Sharma > > >
