[
https://issues.apache.org/jira/browse/MESOS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039413#comment-14039413
]
Dominic Hamon commented on MESOS-1523:
--------------------------------------
https://reviews.apache.org/r/22843
> ZooKeeper timeout should be longer
> ----------------------------------
>
> Key: MESOS-1523
> URL: https://issues.apache.org/jira/browse/MESOS-1523
> Project: Mesos
> Issue Type: Improvement
> Components: slave
> Reporter: Dominic Hamon
> Assignee: Dominic Hamon
>
> {{zookeeper_init}} relies on name resolution which can temporarily fail. When
> {{getaddrinfo}} returns {{EAI_AGAIN}}, which normally suggests a retry,
> ZooKeeper instead returns {{EINVAL}} to the calling code. We currently use
> this as a signal that we should retry.
> However, our timeout is set to 10 seconds. If there are, say, three
> nameservers and each takes fifteen seconds to timeout, we will see a single
> call to {{zookeeper_init}} that takes 45 seconds and will thus only try once
> before aborting.
> To increase resilience in the case of name server failure, we should increase
> this timeout.
> Given that the slave is still able to respond to health checks and tasks are
> still running, this can be quite long. However, we don't want to stay in this
> state too long as we want to readily observer a more persistent name
> resolution error.
> As such, ten minutes seems reasonable.
--
This message was sent by Atlassian JIRA
(v6.2#6252)