[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hoyong.eom updated ZOOKEEPER-5014:
----------------------------------
    Description: 
h3. Background

When a ZooKeeper client needs to reconnect, it resolves hostnames via DNS.
If the DNS server is temporarily unavailable, the client cannot resolve 
hostnames and fails to reconnect - even if the ZooKeeper servers are healthy 
and IP addresses haven't changed.

*Note:* This is different from 
[ZOOKEEPER-4921|https://issues.apache.org/jira/browse/ZOOKEEPER-4921] 
(network failure reconnection bug in 3.9.3, fixed in 3.9.4).
This proposal addresses DNS server outages specifically.

h3. Problem

* Client has successfully connected before (DNS was working)
* DNS server becomes temporarily unavailable (maintenance, restart, network 
issue)
* Client tries to reconnect but fails DNS resolution
* Connection fails even though ZooKeeper server IP hasn't changed

h3. Proposal

Cache successfully resolved IP addresses and use them as fallback when DNS 
resolution fails.

* New option: \{{zookeeper.client.dnsFallback.enabled}} (default: \{{false}})
* On successful DNS resolution: cache the IP address
* On DNS failure + option enabled: use cached IP as fallback
* Backward compatible (disabled by default)

h3. Use Cases

* On-premise environments with unstable DNS infrastructure
* Environments where server IP addresses rarely change
* Temporary DNS outages (maintenance windows, DNS server restarts)

h3. Implementation

*ZKClientConfig.java:*
{code:java}
public static final String ZOOKEEPER_DNS_FALLBACK_ENABLED = 
"zookeeper.client.dnsFallback.enabled";
public static final boolean ZOOKEEPER_DNS_FALLBACK_ENABLED_DEFAULT = false;

public boolean isDnsFallbackEnabled() {
    return getBoolean(ZOOKEEPER_DNS_FALLBACK_ENABLED, 
ZOOKEEPER_DNS_FALLBACK_ENABLED_DEFAULT);
}
{code}

*StaticHostProvider.java:*
{code:java}
private final Map<String, InetAddress> resolvedAddressCache = new 
ConcurrentHashMap<>();

private InetSocketAddress resolve(InetSocketAddress address) {
    String hostString = address.getHostString();
    try {
        InetAddress resolved = resolver.getAllByName(hostString)[0];
        // Cache on success
        resolvedAddressCache.put(hostString, resolved);
        return new InetSocketAddress(resolved, address.getPort());
    } catch (UnknownHostException e) {
        // Fallback to cached IP if enabled
        if (clientConfig.isDnsFallbackEnabled()) {
            InetAddress cached = resolvedAddressCache.get(hostString);
            if (cached != null) {
                LOG.warn("DNS failed for {}, using cached IP {}", hostString, 
cached);
                return new InetSocketAddress(cached, address.getPort());
            }
        }
        return address;
    }
}
{code}

h3. Usage

{code}
zookeeper.client.dnsFallback.enabled=true
{code}

  was:
h2. Problem

When DNS server is temporarily unavailable, ZooKeeper client cannot reconnect 
to the ZooKeeper server even if:
- The client was previously connected successfully
- The ZooKeeper server is still running and healthy
- Only the DNS server is down

h2. Current Behavior

In \{{StaticHostProvider.resolve()}}, when DNS resolution fails:

{code:java}
} catch (UnknownHostException e) {
    LOG.error("Unable to resolve address: {}", address.toString(), e);
    return address;  // Returns unresolved address
}
{code}

The client returns an unresolved \{{InetSocketAddress}}, which causes 
connection failures.

h2. Test Results

|| Test Case || Result || Time ||
| localhost:2181 | Connected | 164ms |
| non-existent-host.invalid:2181 | Failed | 10,005ms (timeout) |

Exception chain:

{code}
UnknownHostException → IllegalArgumentException → ConnectionLossException
{code}

Tested with Zookeeper client 3.7.2 / 3.9.4, Curator 5.6.0 / 5.7.1, Java 21.

h2. Proposal

Cache the last successfully resolved IP address and use it as fallback when DNS 
resolution fails.

{code:java}
private final Map<String, InetAddress> resolvedAddressCache = new 
ConcurrentHashMap<>();

private InetSocketAddress resolve(InetSocketAddress address) {
    String hostname = address.getHostString();
    
    try {
        InetAddress resolved = resolver.getAllByName(hostname)[0];
        // Cache on success
        resolvedAddressCache.put(hostname, resolved);
        return new InetSocketAddress(resolved, address.getPort());
    } catch (UnknownHostException e) {
        // Fallback to cached address
        if (clientConfig.isDnsFallbackEnabled()) {
            InetAddress cached = resolvedAddressCache.get(hostname);
            if (cached != null) {
                LOG.warn("DNS failed for {}, using cached address: {}", 
hostname, cached);
                return new InetSocketAddress(cached, address.getPort());
            }
        }
        LOG.error("Unable to resolve address: {}", address.toString(), e);
        return address;
    }
}
{code}

h2. Design Considerations

* *Disabled by default*: New property 
\{{zookeeper.client.dnsFallback.enabled=false}}
* *Backward compatible*: Existing behavior unchanged unless explicitly enabled
* *Complements existing work*: Does not conflict with ZOOKEEPER-2184 
(re-resolve on connection failure)

h2. Use Case

This is useful in cloud/container environments where:
* DNS server may have temporary failures
* ZooKeeper server IP remains stable
* Client should maintain connection resilience

h2. Related Issues

* ZOOKEEPER-2184: Re-resolve hosts when connection fails
* ZOOKEEPER-1506: Re-try DNS resolution if node connection fails
* CURATOR-229: No retry on DNS lookup failure

I'm happy to submit a PR if this approach is acceptable.


> Cache resolved IP addresses as fallback for DNS server failures
> ---------------------------------------------------------------
>
>                 Key: ZOOKEEPER-5014
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-5014
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: java client
>    Affects Versions: 3.9.4
>            Reporter: hoyong.eom
>            Priority: Minor
>
> h3. Background
> When a ZooKeeper client needs to reconnect, it resolves hostnames via DNS.
> If the DNS server is temporarily unavailable, the client cannot resolve 
> hostnames and fails to reconnect - even if the ZooKeeper servers are healthy 
> and IP addresses haven't changed.
> *Note:* This is different from 
> [ZOOKEEPER-4921|https://issues.apache.org/jira/browse/ZOOKEEPER-4921] 
> (network failure reconnection bug in 3.9.3, fixed in 3.9.4).
> This proposal addresses DNS server outages specifically.
> h3. Problem
> * Client has successfully connected before (DNS was working)
> * DNS server becomes temporarily unavailable (maintenance, restart, network 
> issue)
> * Client tries to reconnect but fails DNS resolution
> * Connection fails even though ZooKeeper server IP hasn't changed
> h3. Proposal
> Cache successfully resolved IP addresses and use them as fallback when DNS 
> resolution fails.
> * New option: \{{zookeeper.client.dnsFallback.enabled}} (default: \{{false}})
> * On successful DNS resolution: cache the IP address
> * On DNS failure + option enabled: use cached IP as fallback
> * Backward compatible (disabled by default)
> h3. Use Cases
> * On-premise environments with unstable DNS infrastructure
> * Environments where server IP addresses rarely change
> * Temporary DNS outages (maintenance windows, DNS server restarts)
> h3. Implementation
> *ZKClientConfig.java:*
> {code:java}
> public static final String ZOOKEEPER_DNS_FALLBACK_ENABLED = 
> "zookeeper.client.dnsFallback.enabled";
> public static final boolean ZOOKEEPER_DNS_FALLBACK_ENABLED_DEFAULT = false;
> public boolean isDnsFallbackEnabled() {
>     return getBoolean(ZOOKEEPER_DNS_FALLBACK_ENABLED, 
> ZOOKEEPER_DNS_FALLBACK_ENABLED_DEFAULT);
> }
> {code}
> *StaticHostProvider.java:*
> {code:java}
> private final Map<String, InetAddress> resolvedAddressCache = new 
> ConcurrentHashMap<>();
> private InetSocketAddress resolve(InetSocketAddress address) {
>     String hostString = address.getHostString();
>     try {
>         InetAddress resolved = resolver.getAllByName(hostString)[0];
>         // Cache on success
>         resolvedAddressCache.put(hostString, resolved);
>         return new InetSocketAddress(resolved, address.getPort());
>     } catch (UnknownHostException e) {
>         // Fallback to cached IP if enabled
>         if (clientConfig.isDnsFallbackEnabled()) {
>             InetAddress cached = resolvedAddressCache.get(hostString);
>             if (cached != null) {
>                 LOG.warn("DNS failed for {}, using cached IP {}", hostString, 
> cached);
>                 return new InetSocketAddress(cached, address.getPort());
>             }
>         }
>         return address;
>     }
> }
> {code}
> h3. Usage
> {code}
> zookeeper.client.dnsFallback.enabled=true
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to