[
https://issues.apache.org/jira/browse/MESOS-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neil Conway updated MESOS-3790:
-------------------------------
Assignee: (was: Neil Conway)
> Zk connection should retry on EAI_NONAME
> ----------------------------------------
>
> Key: MESOS-3790
> URL: https://issues.apache.org/jira/browse/MESOS-3790
> Project: Mesos
> Issue Type: Bug
> Reporter: Neil Conway
> Priority: Minor
> Labels: mesosphere, zookeeper
>
> The zookeeper interface is designed to retry (once per second for up to ten
> minutes) if one or more of the Zookeeper hostnames can't be resolved (see
> [MESOS-1326] and [MESOS-1523]).
> However, the current implementation assumes that a DNS resolution failure is
> indicated by zookeeper_init() returning NULL and errno being set to EINVAL
> (Zk translates getaddrinfo() failures into errno values). However, the
> current Zk code does:
> {code}
> static int getaddrinfo_errno(int rc) {
> switch(rc) {
> case EAI_NONAME:
> // ZOOKEEPER-1323 EAI_NODATA and EAI_ADDRFAMILY are deprecated in FreeBSD.
> #if defined EAI_NODATA && EAI_NODATA != EAI_NONAME
> case EAI_NODATA:
> #endif
> return ENOENT;
> case EAI_MEMORY:
> return ENOMEM;
> default:
> return EINVAL;
> }
> }
> {code}
> getaddrinfo() returns EAI_NONAME when "the node or service is not known"; per
> discussion in [MESOS-2186], this seems to happen intermittently due to DNS
> failures.
> Proposed fix: looking at errno is always going to be somewhat fragile, but if
> we're going to continue doing that, we should check for ENOENT as well as
> EINVAL.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)