Daniel Nugent created MESOS-3036:
------------------------------------
Summary: Slave timeout options have drastically different
behaviors. Interaction unclear/non-obvious from documentation.
Key: MESOS-3036
URL: https://issues.apache.org/jira/browse/MESOS-3036
Project: Mesos
Issue Type: Documentation
Affects Versions: 0.22.1
Reporter: Daniel Nugent
The documentation for the Slave's recovery_timeout option would seem to
indicate that a recovery of up to 15 minutes is possible. However, because of
the Master's slave_ping_timeout and max_slave_ping_timeouts option, any
recovery that occurs 75 seconds (plus apparently some time between the time the
slave dies and the Master registers the slave as disconnected [can't find a
flag that determines this]) will result in all tasks running under the slave to
be stopped and then restarted if the Slave can recover because the master has
moved the tasks into the TASK_LOST state.
The documentation should clearly state that tasks will be stopped upon Slave
recovery even within the recovery_timeout period if the ping_timeout options
have caused the master to shut down the slave.
Also, maybe explain what the project intends the recovery_timeout setting to
actually be used for? I'm a little unclear on that point now myself. Presumably
some fudge factor to allow tasks to have time to restart on another slave if
the slave is out of commission?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)