Re: MESOS-6233 Allow agents to re-register post a host reboot

Joris Van Remoortere Mon, 12 Dec 2016 11:34:25 -0800

>
> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):



>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    
> <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
>  indefinitely
>    until the operator intervenes.
>
> Can you elaborate on this?

Would you run into this because you don't explicitly specify the memory
resource in the agent configuration? I think we highly recommend that you
do this in production to prevent accidental incompatibility of resources
even without an actual hardware change. Historically there were some issues
reported where the kernel reported a slightly different amount of memory
after reboot.

—
*Joris Van Remoortere*
Mesosphere

On Mon, Nov 28, 2016 at 6:09 PM, Yan Xu <[email protected]> wrote:

> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):
>
>
>    - Currently: the agent would skip the recovery (and the chance of
>    running into incompatible agent info) and register as a new agent.
>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    
> <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
>    indefinitely until the operator intervenes.
>
>
> To mitigate this and maintain the current behavior, we can have the agent
> remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
> failure but only after the host has rebooted. This way the agent can
> restart as a new agent without operator intervention.
>
> Any thoughts?
>
> BTW this speaks to the need for MESOS-1739.
>
> Yan
>
> On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <[email protected]> wrote:
>
>> Hi All,
>>
>> We have been working on the design for Restartable tasks (
>> MESOS-3545) and allowing agents to recover and re-register post reboot is a
>> pre-requisite for that.
>> Agent today doesn’t recover its state that includes its SlaveID post a
>> host reboot, it short-circuits the recovery upon discovering the reboot and
>> registers with the master as a new agent. With Partition Awareness, the
>> mesos master even allows agents which have failed master’s health check
>> pings (unreachable agents) to re-register with it and reconcile the
>> tasks/executors. The executors on a rebooted host are anyway terminated so
>> there is no harm in letting such an agent recover and re-register with the
>> master using its old SlaveID.
>> Would like to hear from the folks here if you see any operational
>> concerns with letting the agents recover post a host reboot.
>>
>> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>>
>> Many Thanks
>> Megha Sharma
>>
>>
>>
>

Re: MESOS-6233 Allow agents to re-register post a host reboot

Reply via email to