Well, I isolate the service on a load balancer server with the minimal
configuration, let me detail the problem.
Two equal (cloned) Debian load balancers with 3.16.7-ckt9 kernel, both
working with keepalived sharing the ip address of the proxy-tcp service
(192.168.1.100). In A server the Haproxy is v.1.8.1 and in B v.1.7.4,
both have the following configuration (only in 1.7 the "master-worker"
line is removed):
global
master-worker
node balancer
log /dev/log local1 info
stats socket /var/run/haproxy.sock mode 0660 user testuser1
group testuser1 level admin
pidfile /var/run/haproxy.pid
maxconn 32000
nbproc 1
defaults
mode tcp
log global
retries 3
option redispatch
maxconn 100000
fullconn 100000
timeout connect 5s
timeout server 50s
timeout client 50s
timeout http-keep-alive 60s
default-server on-marked-down shutdown-sessions inter 5s
fastinter 1s downinter 1s rise 2 fall 2
listen proxy-stats
bind *:80
mode http
stats enable
stats show-legends
stats uri /
stats realm Haproxy\ Statistics
stats refresh 5s
listen proxy-tcp
bind 192.168.1.100:8080
option tcplog
balance roundrobin
option httpchk GET /test
http-check expect string ok
server server1 192.168.1.101:8080 check
server server2 192.168.1.102:8080 check
After changing the traffic between both servers with keepalived, the
results are:
1.7:
Session rate: 861
Sessions: 220
1.8:
Session rate: 835
Sessions: 31243
These are the evidences in 1.8 server:
- The ulimit-n is 64034 and the Sessions Max reach 31999 both in
Frontend and Backend of the lister proxy-tcp, which I suppose that the
limit is reached by consequence of the issue.
- The system log report a lot of "TCP: too many orphaned sockets" and
some like "net_ratelimit: 2094 callbacks suppressed"
- The Haproxy log register the total time elapsed between the accept and
the last close is equal to the 50s assigned to server and client
timeout.
- The termination state is ok in 99% of them. Yesterday I said that it
was "sD", but today I check that is very rare, I put one line here only
as example.
Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:58674
[21/Dec/2017:15:37:19.358] proxy-tcp proxy-tcp/server2 1/0/50001 1087 --
30211/30210/30209/15106/0 0/0
Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.42:51027
[21/Dec/2017:15:37:19.356] proxy-tcp proxy-tcp/server2 1/0/50008 5345 sD
30214/30213/30210/15106/0 0/0
Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:40442
[21/Dec/2017:15:37:19.364] proxy-tcp proxy-tcp/server1 1/0/50003 694 --
30216/30215/30211/15104/0 0/0
- From 30522 tcp sockets to the proxy-tcp address, there are 30160 in
CLOSE_WAIT state over local address 192.168.1.100:8080. Yesterday I
point that it was from the backend side, but I'm wrong, all are from the
fronted side.
I have the ouput of "show sess all" and "show fd", but as it have a lot
of private information, I send you using an alternative way. If there
are any clear evidence there, I will take the time to anonymize and
share.
Thanks,
El mié, 20-12-2017 a las 18:19 +0100, Willy Tarreau escribió:
> Hello Ricardo,
>
> On Wed, Dec 20, 2017 at 05:00:33PM +0100, Ricardo Fraile wrote:
> > Hello,
> >
> > After upgrade from 1.7.4 to 1.8.1, basically with the end of mail conf
> > snippet, the sessions started to grow, as example:
> >
> > 1.7.4:
> > Active sessions: ~161
> > Active sessions rate: ~425
> >
> > 1.8.1:
> > Active sessions: ~6700
> > Active sessions rate: ~350
>
> Ah that's not good :-(
>
> > Looking into the linux (3.16.7) server, there are a high number of
> > CLOSE_WAIT connections from the bind address of the listen service to
> > the backend nodes.
>
> Strange, I don't understand well what type of traffic could cause this
> except a loop, that sounds a bit unusual.
>
> > System logs reported "TCP: too many orphaned sockets", but after
> > increase net.ipv4.tcp_max_orphans value, the message stops but nothing
> > changes.
>
> Normally orphans correspond to closed sockets for which there are still
> data in the system's buffers so this should be unrelated to the CLOSE_WAIT,
> unless there's a loop somewhere where a backend reconnects to the frontend,
> which can explain both situations at once when the timeout strikes.
>
> > Haproxy logs reported for that listen the indicator "sD", but only with
> > 1.8.
>
> Thus a server timeout during the end of the transfer. That doesn't make
> much sense either.
>
> > Any ideas to dig into the issue?
>
> It would be very useful that you share your configuration (please remove
> any sensitive info like stats passwords or IP addresses you prefer to keep
> private). When running 1.8, it would be useful to issue the following
> commands on the CLI and capture the output to a file :
> - "show sess all"
> - "show fd"
>
> Warning, the first one will reveal a lot of info (internal addresses etc)
> so you may want to send it privately and not to the list if this is the
> case (though it takes longer to diagnose it :-)).
>
> If you think you can reproduce this on a test machine out of production,
> that would be extremely useful.
>
> We have not noticed any single such issue on haproxy.org which has delivered
> about 100 GB and 2 million requests over the last 2 weeks, with this exact
> same version, so that makes me think that either the config or the type of
> traffic count a lot there to trigger the problem you are observing.
>
> Regards,
> Willy