zhaorongsheng opened a new issue, #63358:
URL: https://github.com/apache/doris/issues/63358

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   Doris BE 2.1.x.
   
   ### What's Wrong?
   
   After a group of BE nodes was permanently removed from the cluster (DROP 
BACKEND on the FE, the machines were shut down, and their DNS A/PTR records 
were deleted), every surviving BE in the same cluster keeps logging two kinds of
     WARNING forever:
   
     Symptom A — DNSCache refresh thread floods be.WARNING
   
     W<date> <ts> <tid> network_util.cpp:115] failed to get ip from host: 
be-old-1.example.com  err: Name or service not known
     W<date> <ts> <tid> status.h:415] meet error status: [INTERNAL_ERROR]failed 
to get ip from host: be-old-1.example.com, err: Name or service not known
             0#  doris::hostname_to_ipv4(...) at 
be/src/util/network_util.cpp:125
             1#  doris::hostname_to_ip(...)   at 
be/src/util/network_util.cpp:104
             2#  doris::DNSCache::_update(...) at be/src/common/status.h:494
             3#  doris::DNSCache::_refresh_cache() at be/src/common/status.h:380
   
     Once per minute per stale hostname, indefinitely.
   
     Symptom B — brpc keeps reconnecting to the cached (now unreachable) IP
   
     W<date> <ts> <tid> socket.cpp:1270] Fail to wait EPOLLOUT of fd=<n>: 
Connection timed out [110]
   
     In our case this fires ~4 times per second, ~340K times per hour, 
accumulating > 3.7M occurrences over 11 days. The IPs the BE keeps trying to 
reach are the last successfully resolved IPs of the dropped hostnames, served 
back by
     DNSCache::_resolve_hostname() after every refresh failure. A single BE's 
be.WARNING grew to 634 MB in 11 days — multiplied by every BE in the cluster.
   
     Root cause
   
     be/src/util/dns_cache.cpp (master HEAD, lines 57–121):
   
     - _refresh_cache() iterates every cached hostname every 60 s and calls 
_update.
     - _update → _resolve_hostname. On resolution failure, _resolve_hostname 
returns the stale cached IP so callers can keep using it. That is a reasonable 
graceful-degradation choice.
     - However, the entry is never removed from the cache map. There is no 
failure counter, no TTL, no eviction policy.
     - Consequence: as long as the BE process lives, the hostname is 
re-resolved (and re-fails) once per minute, forever. BrpcClientCache / 
ClientCache keep handing the stale IP to brpc, which keeps timing out at the 
kernel level (ETIMEDOUT
     after tcp_syn_retries, ~127 s).
   
   ### What You Expected?
   
     1. Bring up a Doris cluster (≥ 2 BEs).
     2. Pick a hostname victim.example.com that points to a working BE. Issue 
queries / data ingestion that go through DNSCache::get (e.g. broker load, 
internal RPC) so the hostname enters the cache.
     3. Decommission and remove the BE: DROP BACKEND "victim.example.com:9050";
     4. Delete victim.example.com from DNS (or /etc/hosts).
     5. Observe be.WARNING on the other BEs. Within 1 minute the first failed 
to get ip from host line appears. It never goes away.
   
   ### How to Reproduce?
   
     1. Bring up a Doris cluster (≥ 2 BEs).
     2. Pick a hostname victim.example.com that points to a working BE. Issue 
queries / data ingestion that go through DNSCache::get (e.g. broker load, 
internal RPC) so the hostname enters the cache.
     3. Decommission and remove the BE: DROP BACKEND "victim.example.com:9050";
     4. Delete victim.example.com from DNS (or /etc/hosts).
     5. Observe be.WARNING on the other BEs. Within 1 minute the first failed 
to get ip from host line appears. It never goes away.
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to