zhaorongsheng opened a new pull request, #63363:
URL: https://github.com/apache/doris/pull/63363
Proposed changes
Issue Number: close #63358
DNSCache currently never evicts an entry once it has been inserted. When a
backend host is permanently dropped from the cluster and its DNS record is
removed, every other BE in the cluster keeps:
1. logging failed to get ip from host: <removed> every 60 s, indefinitely
(from DNSCache::_refresh_cache -> hostname_to_ipv4);
2. handing the stale cached IP back to brpc through BrpcClientCache /
ClientCache, which then keeps emitting Fail to wait EPOLLOUT ... Connection
timed out at the brpc socket layer.
This PR adds a simple consecutive-failure counter to DNSCache. When the
counter reaches a configurable threshold, the entry is removed from the cache
so that callers no longer get a stale IP and the refresh thread stops logging
about
it. WARNING logs for the same host are also throttled to avoid flooding
be.WARNING.
Configs introduced
Name: dns_cache_max_consecutive_failures
Type: mInt32
Default: 30
Behavior: Evict a hostname after this many consecutive resolution
failures. At the default 60 s refresh interval, that means ~30 minutes of
grace. Set <= 0 to disable eviction (legacy behavior).
────────────────────────────────────────
Name: dns_cache_log_every_n_failures
Type: mInt32
Default: 60
Behavior: Throttle the Failed to resolve ... use cached ip warning to once
per N failures per hostname. Set <= 1 to log every failure (legacy behavior).
Both are mutable so operators can tune without restarting BE.
Backward compatibility
Setting dns_cache_max_consecutive_failures = 0 and
dns_cache_log_every_n_failures = 1 reproduces exactly the pre-PR behavior.
Successful resolution clears the failure counter, so transient DNS hiccups
don't accumulate across hours. No
public API or wire format changes.
Further comments
- I deliberately did not touch the FE-side DNSCache.java. If the same fix
is wanted on FE, happy to send a follow-up PR.
- The eviction threshold default (30) is conservative; please push back if
you'd prefer a smaller / larger default. Operators with very flaky DNS can
lower it via the mutable config without redeploying.
---
Checklist
- I have read the Contributing document.
- I have created an issue (#63358) on (or commented on) the related issue.
- I have added unit tests for my change.
- All new and existing tests passed (verified locally → fails to build on
macOS due to unrelated contrib/openblas issue; rely on CI).
- My change requires a change to the documentation. — No (config doc
auto-generated)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]