[
https://issues.apache.org/jira/browse/MESOS-2301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297779#comment-14297779
]
Benjamin Mahler edited comment on MESOS-2301 at 1/29/15 10:01 PM:
------------------------------------------------------------------
UnregisterSlaveMessage is only sent when you're using SIGUSR1 to shutdown the
slave, I don't think that's relevant here, unless you're using SIGUSR1 to give
the slave a heads up? Otherwise, the slave isn't aware of the machine going
down until it's restarted after the reboot.
What I suspect is that you're seeing as an "issue" is the effect of our
conservative value for {{\-\-slave_reregister_timeout}}. We set this to 10
minutes at the minimum, but we can update this so that it's much closer to the
health check timeout (75 seconds by default). When you lower this, you'll
likely want to take a look at {{\-\-recovery_slave_removal_limit}} as a safety
measure against a lot of slaves getting removed.
was (Author: bmahler):
UnregisterSlaveMessage is only sent when you're using SIGUSR1 to shutdown the
slave, I don't think that's relevant here, unless you're using SIGUSR1 to give
the slave a heads up? Otherwise, the slave isn't aware of the machine going
down until it's restarted after the reboot.
What I suspect is that you're seeing as an "issue" is the effect of our
conservative value for {{--slave_reregister_timeout}}. We set this to 10
minutes at the minimum, but we can update this so that it's much closer to the
health check timeout (75 seconds by default). When you lower this, you'll
likely want to take a look at {{--recovery_slave_removal_limit}} as a safety
measure against a lot of slaves getting removed.
> Slave does not cleanly unregister
> ---------------------------------
>
> Key: MESOS-2301
> URL: https://issues.apache.org/jira/browse/MESOS-2301
> Project: Mesos
> Issue Type: Bug
> Components: master, slave
> Reporter: Dario Rexin
>
> If a machine running the mesos slave is being rebooted, the mesos slave does
> a clean shutdown. It stops alls its executors, unregisters from the master
> and removes the symlink to the latest state.
> However, if the master is not reachable during the reboot, it will still
> remove the symlink to the latest state and will register with a new ID when
> restarted. This leads to the master waiting for the slave to come back for
> the configured amount if time and not marking the tasks as lost or killed.
> This also means, that these tasks will not be restarted by the framework (in
> this case Marathon), because it assumes they are still alive.
> This problem could be solved by introducing a new message
> `SlaveUnregisteredMessage` that gets send by the master when a slave
> successfully unregistered. The slav only has to wait for this message and if
> it doesn't receive it, it should not remove the symlink to `latest`.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)