Dominic Hamon created MESOS-1523:
------------------------------------

             Summary: ZooKeeper timeout should be longer
                 Key: MESOS-1523
                 URL: https://issues.apache.org/jira/browse/MESOS-1523
             Project: Mesos
          Issue Type: Improvement
          Components: slave
            Reporter: Dominic Hamon
            Assignee: Dominic Hamon


{{zookeeper_init}} relies on name resolution which can temporarily fail. When 
{{getaddrinfo}} returns {{EAI_AGAIN}}, which normally suggests a retry, 
ZooKeeper instead returns {{EINVAL}} to the calling code. We currently use this 
as a signal that we should retry.

However, our timeout is set to 10 seconds. If there are, say, three nameservers 
and each takes fifteen seconds to timeout, we will see a single call to 
{{zookeeper_init}} that takes 45 seconds and will thus only try once before 
aborting.

To increase resilience in the case of name server failure, we should increase 
this timeout.

Given that the slave is still able to respond to health checks and tasks are 
still running, this can be quite long. However, we don't want to stay in this 
state too long as we want to readily observer a more persistent name resolution 
error.

As such, ten minutes seems reasonable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to