[ 
https://issues.apache.org/jira/browse/MESOS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039413#comment-14039413
 ] 

Dominic Hamon commented on MESOS-1523:
--------------------------------------

https://reviews.apache.org/r/22843

> ZooKeeper timeout should be longer
> ----------------------------------
>
>                 Key: MESOS-1523
>                 URL: https://issues.apache.org/jira/browse/MESOS-1523
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Dominic Hamon
>            Assignee: Dominic Hamon
>
> {{zookeeper_init}} relies on name resolution which can temporarily fail. When 
> {{getaddrinfo}} returns {{EAI_AGAIN}}, which normally suggests a retry, 
> ZooKeeper instead returns {{EINVAL}} to the calling code. We currently use 
> this as a signal that we should retry.
> However, our timeout is set to 10 seconds. If there are, say, three 
> nameservers and each takes fifteen seconds to timeout, we will see a single 
> call to {{zookeeper_init}} that takes 45 seconds and will thus only try once 
> before aborting.
> To increase resilience in the case of name server failure, we should increase 
> this timeout.
> Given that the slave is still able to respond to health checks and tasks are 
> still running, this can be quite long. However, we don't want to stay in this 
> state too long as we want to readily observer a more persistent name 
> resolution error.
> As such, ten minutes seems reasonable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to