During a recent DNS outage, TCP connections from HAProxy to DNS resolvers 
spiked. Even after DNS recovered, connections stayed ~3,000 (normal is ~2). 
Although dns_process_idle_exp should clean up idle sessions, it didn’t in this 
state. tcpdump showed the DNS server closing idle connections after ~30s, but 
HAProxy immediately reopened them. Restarting HAProxy was required to return to 
normal connection counts. Setting maxconn on resolvers (also could test pulling 
in this 
patch<https://github.com/haproxy/haproxy/commit/5288b39011b2449bfa896f7932c7702b5a85ee77>)
 mitigates the spike but not the post-recovery persistence.

Environment
HAProxy version:
HAProxy version 2.9.4-9839cb-6 2024/07/31 - https://haproxy.org/ Status: stable 
branch - will stop receiving fixes around Q1 2025. Known bugs: 
http://www.haproxy.org/bugs/bugs-2.9.4.html Running on: Linux 5.15.173.1-2.cm2 
#1 SMP Fri Feb 7 02:18:38 UTC 2025 x86_64

HAProxy Config:
defaults
mode http
timeout client 10s
timeout connect 5s
timeout server 10s
timeout http-request 10s
resolvers systemdns
nameserver dns1 [email protected]:53
nameserver dns2 [email protected]:53
parse-resolv-conf
timeout resolve 1s
frontend one
bind 127.0.0.1:8002
mode http
use_backend myservers
frontend stats
bind ipv4@:8889,ipv6@:8889
stats enable
stats uri /stats
stats refresh 10s
backend myservers
mode http
http-request return status 200 content-type "plain/text" string GOOD
backend T1
balance roundrobin
timeout connect 500ms
option redispatch 2
server-template t1 1-2000 "my-domain.com:8446" ssl verify none resolvers 
systemdns init-addr none maxconn 2000
backend T2
balance roundrobin
timeout connect 500ms
option redispatch 2
server-template t2 1-2000 "my-domain2.com:8446" ssl verify none resolvers 
systemdns init-addr none maxconn 2000

How to reproduce:

  1.
Start HAProxy
  2.
Watch open tcp connections to dns servers:
     *
while true; do   echo "$(date): $(ss -tanp | grep -E '(X.X.X.X|X.X.X.Y)' | wc 
-l) connections";   sleep 3; done
     *
Should only have 2 open tcp connections (1 per resolver)
  3.
Inject/simulate dns failure
     *
sudo iptables -I OUTPUT -p tcp -d X.X.X.X--dport 53 -m state --sta
te ESTABLISHED -m length --length 52: -j DROP
     *
Observe tcp connections increase
  4.
Undo failure
     *
sudo iptables -D OUTPUT -p tcp -d X.X.X.X--dport 53 -m state --sta
te ESTABLISHED -m length --length 52: -j DROP
     *
Observe that connections stay high

Note: The iptables fault injection may not exactly match the original outage 
but produces similar symptoms. I tried with both the version deployed during 
the issue and latest and was able to reproduce with both.

Observations
Post-recovery, HAProxy maintains ~3k DNS TCP connections.
DNS server closes idle sessions after ~30s; HAProxy immediately reopens them.
Restarting HAProxy reduces connections to normal.
Adding maxconn on resolvers prevents the spike, but not the sticky high count 
after recovery.


Hypotheses
dns_process_idle_exp<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1083>
 target calculation

max_active_conns<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1116>
 is reset to cur_active_conns.
After outage → recovery, an initial cleanup sets max_active_conns low; 
target<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L1097>
 becomes 0, skipping later idle cleanup cycles, leaving excess sessions open.



Stuck “zombie” sessions

Failure mode plus timeouts/TCP/DNS settings yields connections with 
onfly_queries == 0 and nb_queries > 1.
When the DNS server/kernel kills the TCP connection and 
dns_session_release<https://github.com/haproxy/haproxy/blob/8aef5bec1ef57eac449298823843d6cc08545745/src/dns.c#L914C13-L914C32>
 runs, HAProxy recreates the session, preserving a bad state and keeping counts 
high.


Might be totally off here, and setting maxconn does mitigate this issue for us, 
but thought I'd reach out to see if anyone else had any thoughts. Thanks!

Reply via email to