Hi,
after upgrading from BIND 9.20.21 to 9.20.23 on Debian 13, I am seeing a large
accumulation of TCP connections in CLOSE_WAIT state to port 853 when forwarding
queries to DoT upstream servers (tested with Cloudflare and DNS4EU).
After some time under normal load, "ss -tnp | grep 853 | awk '{print $1}' |
sort | uniq -c | sort -rn" shows something like:
4465 CLOSE-WAIT
2 ESTAB
Connections in CLOSE_WAIT accumulate continuously across all configured DoT
upstream servers:
$ ss -tnp | grep 853 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort
-rn
1321 86.54.11.200 (DNS4EU)
1203 86.54.11.100 (DNS4EU)
1080 1.1.1.1 (Cloudflare)
861 1.0.0.1 (Cloudflare)
The same error pattern occurs for all domains, regardless of the queried domain
or upstream server. Observed examples:
info: shut down hung fetch while resolving 0xXXXXXXXXX000(<ext-domain>/A)
debug 1: set ede: info-code 22 extra-text (null)
debug 1: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): rpz QNAME
rewrite <ext-domain> stop on unrecognized qresult in rpz_rewrite() failed:
SERVFAIL
debug 1: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): query failed
(SERVFAIL) for <ext-domain>/IN/A at query.c:7860
debug 2: fetch completed for <ext-domain>/A in 12.000205: SERVFAIL/success
[domain:.,referral:0,restart:1,qrysent:1,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
debug 3: client @0xXXXXXXXXX000 <client-ip>#56707 (<ext-domain>): send failed:
operation canceled
query-errors: debug 1: client @0xXXXXXXXXX000 <client-ip>#56402 (<ext-domain>):
rpz QNAME rewrite <ext-domain> stop on unrecognized qresult in rpz_rewrite()
failed: SERVFAIL
query-errors: info: client @0xXXXXXXXXX000 <client-ip>#56402 (<ext-domain>):
query failed (SERVFAIL) for <ext-domain>/IN/A at query.c:7860
query-errors: debug 2: fetch completed for <ext-domain>/A in 12.004205:
SERVFAIL/success
[domain:.,referral:0,restart:1,qrysent:0,timeout:0,lame:0,quota:0,neterr:0,badresp:0,adberr:0,findfail:0,valfail:0]
Impact:
- Initially, there are none or only a few SERVFAIL errors; later, there are
significantly more. In some cases, DNS becomes unusable
- Query timeouts of exactly 12 seconds before failure
- System accumulates thousands of zombie TCP connections
- Issue affects all configured DoT upstream providers simultaneously, ruling
out an upstream-side issue
Downgrading to 9.20.21 fully resolves the issue.
Has anyone else seen this? Is there a configuration-level workaround that
properly closes stale TLS connections? Or is this a bug?
Thanks
Dennis
--
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from
this list.