Daniel Nugent created MESOS-3036:
------------------------------------

             Summary: Slave timeout options have drastically different 
behaviors. Interaction unclear/non-obvious from documentation.
                 Key: MESOS-3036
                 URL: https://issues.apache.org/jira/browse/MESOS-3036
             Project: Mesos
          Issue Type: Documentation
    Affects Versions: 0.22.1
            Reporter: Daniel Nugent


The documentation for the Slave's recovery_timeout option would seem to 
indicate that a recovery of up to 15 minutes is possible. However, because of 
the Master's slave_ping_timeout and max_slave_ping_timeouts option, any 
recovery that occurs 75 seconds (plus apparently some time between the time the 
slave dies and the Master registers the slave as disconnected [can't find a 
flag that determines this]) will result in all tasks running under the slave to 
be stopped and then restarted if the Slave can recover because the master has 
moved the tasks into the TASK_LOST state.

The documentation should clearly state that tasks will be stopped upon Slave 
recovery even within the recovery_timeout period if the ping_timeout options 
have caused the master to shut down the slave.

Also, maybe explain what the project intends the recovery_timeout setting to 
actually be used for? I'm a little unclear on that point now myself. Presumably 
some fudge factor to allow tasks to have time to restart on another slave if 
the slave is out of commission?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to