[
https://issues.apache.org/jira/browse/MESOS-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ilya Pronin updated MESOS-7795:
-------------------------------
Description:
Currently when the agent detects that the host was rebooted it doesn't recover
agent info. New agent info is not checkpointed until the agent successfully
registers with a master. If the agent crashes before registering, on restart it
will recover the old agent info that was checkpointed before host reboot.
This can lead to problems. E.g. the agent may flap due to incompatible agent
info, if its resources somehow change after reboot. Or the usage of the old
agent ID in reregistration process may cause crashes like MESOS-7432.
We can remove the "latest" symlink when we detect that current boot ID is
different from the checkpointed one in order to prevent the agent from
recovering stale info after we checkpoint new boot ID. Or we can postpone boot
ID checkpointing until we checkpointed new agent info.
was:
Currently when the agent detects that the host was rebooted it doesn't recover
agent info. New agent info is not checkpointed until the agent successfully
registers with a master. If the agent crashes before registering, on restart it
will recover the old agent info that was checkpointed before host reboot.
This can lead to problems. E.g. the agent may flap due to incompatible agent
info, if its resources somehow change after reboot. Or the usage of the old
agent ID in reregistration process may cause crashes like MESOS-7432.
We can remove the "latest" symlink when we detect that current boot ID is
different from the checkpointed one in order to prevent the agent from
recovering stale info after we checkpoint new boot ID.
> Remove "latest" symlink after agent reboot
> ------------------------------------------
>
> Key: MESOS-7795
> URL: https://issues.apache.org/jira/browse/MESOS-7795
> Project: Mesos
> Issue Type: Improvement
> Components: agent
> Reporter: Ilya Pronin
> Priority: Minor
>
> Currently when the agent detects that the host was rebooted it doesn't
> recover agent info. New agent info is not checkpointed until the agent
> successfully registers with a master. If the agent crashes before
> registering, on restart it will recover the old agent info that was
> checkpointed before host reboot.
> This can lead to problems. E.g. the agent may flap due to incompatible agent
> info, if its resources somehow change after reboot. Or the usage of the old
> agent ID in reregistration process may cause crashes like MESOS-7432.
> We can remove the "latest" symlink when we detect that current boot ID is
> different from the checkpointed one in order to prevent the agent from
> recovering stale info after we checkpoint new boot ID. Or we can postpone
> boot ID checkpointing until we checkpointed new agent info.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)