zhaorongsheng opened a new pull request, #63363:
URL: https://github.com/apache/doris/pull/63363

   Proposed changes
   
     Issue Number: close #63358
   
     DNSCache currently never evicts an entry once it has been inserted. When a 
backend host is permanently dropped from the cluster and its DNS record is 
removed, every other BE in the cluster keeps:
   
     1. logging failed to get ip from host: <removed> every 60 s, indefinitely 
(from DNSCache::_refresh_cache -> hostname_to_ipv4);
     2. handing the stale cached IP back to brpc through BrpcClientCache / 
ClientCache, which then keeps emitting Fail to wait EPOLLOUT ... Connection 
timed out at the brpc socket layer.
   
     This PR adds a simple consecutive-failure counter to DNSCache. When the 
counter reaches a configurable threshold, the entry is removed from the cache 
so that callers no longer get a stale IP and the refresh thread stops logging 
about
     it. WARNING logs for the same host are also throttled to avoid flooding 
be.WARNING.
   
     Configs introduced
   
     Name: dns_cache_max_consecutive_failures
     Type: mInt32
     Default: 30
     Behavior: Evict a hostname after this many consecutive resolution 
failures. At the default 60 s refresh interval, that means ~30 minutes of 
grace. Set <= 0 to disable eviction (legacy behavior).
     ────────────────────────────────────────
     Name: dns_cache_log_every_n_failures
     Type: mInt32
     Default: 60
     Behavior: Throttle the Failed to resolve ... use cached ip warning to once 
per N failures per hostname. Set <= 1 to log every failure (legacy behavior).
   
     Both are mutable so operators can tune without restarting BE.
   
     Backward compatibility
   
     Setting dns_cache_max_consecutive_failures = 0 and 
dns_cache_log_every_n_failures = 1 reproduces exactly the pre-PR behavior. 
Successful resolution clears the failure counter, so transient DNS hiccups 
don't accumulate across hours. No
     public API or wire format changes.
   
     Further comments
   
     - I deliberately did not touch the FE-side DNSCache.java. If the same fix 
is wanted on FE, happy to send a follow-up PR.
     - The eviction threshold default (30) is conservative; please push back if 
you'd prefer a smaller / larger default. Operators with very flaky DNS can 
lower it via the mutable config without redeploying.
   
     ---
     Checklist
   
     - I have read the Contributing document.
     - I have created an issue (#63358) on (or commented on) the related issue.
     - I have added unit tests for my change.
     - All new and existing tests passed (verified locally → fails to build on 
macOS due to unrelated contrib/openblas issue; rely on CI).
     - My change requires a change to the documentation. — No (config doc 
auto-generated)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to