Well, I isolate the service on a load balancer server with the minimal
configuration, let me detail the problem.

Two equal (cloned) Debian load balancers with 3.16.7-ckt9 kernel, both
working with keepalived sharing the ip address of the proxy-tcp service
(192.168.1.100). In A server the Haproxy is v.1.8.1 and in B v.1.7.4,
both have the following configuration (only in 1.7 the "master-worker"
line is removed):

global
        master-worker
        node balancer
        log /dev/log local1 info
        stats socket /var/run/haproxy.sock mode 0660 user testuser1
group testuser1 level admin
        pidfile /var/run/haproxy.pid
        maxconn 32000
        nbproc 1

defaults
        mode    tcp
        log global
        retries 3
        option redispatch

        maxconn 100000
        fullconn 100000

        timeout connect          5s
        timeout server           50s
        timeout client           50s
        timeout http-keep-alive  60s

        default-server on-marked-down shutdown-sessions inter 5s
fastinter 1s downinter 1s rise 2 fall 2

listen proxy-stats
        bind *:80
        mode http
        stats enable
        stats show-legends
        stats uri     /
        stats realm   Haproxy\ Statistics
        stats refresh 5s

listen proxy-tcp
        bind 192.168.1.100:8080
        option tcplog
        balance roundrobin

        option httpchk GET /test
        http-check expect string ok

        server server1 192.168.1.101:8080 check
        server server2 192.168.1.102:8080 check


After changing the traffic between both servers with keepalived, the
results are:

1.7:
Session rate: 861
Sessions: 220

1.8:
Session rate: 835
Sessions: 31243

These are the evidences in 1.8 server:

- The ulimit-n is 64034 and the Sessions Max reach 31999 both in
Frontend and Backend of the lister proxy-tcp, which I suppose that the
limit is reached by consequence of the issue.

- The system log report a lot of "TCP: too many orphaned sockets" and
some like "net_ratelimit: 2094 callbacks suppressed"

- The Haproxy log register the total time elapsed between the accept and
the last close is equal to the 50s assigned to server and client
timeout.

- The termination state is ok in 99% of them. Yesterday I said that it
was "sD", but today I check that is very rare, I put one line here only
as example.

Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:58674
[21/Dec/2017:15:37:19.358] proxy-tcp proxy-tcp/server2 1/0/50001 1087 --
30211/30210/30209/15106/0 0/0
Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.42:51027
[21/Dec/2017:15:37:19.356] proxy-tcp proxy-tcp/server2 1/0/50008 5345 sD
30214/30213/30210/15106/0 0/0
Dec 21 15:38:09 balancer haproxy[8094]: 192.168.1.55:40442
[21/Dec/2017:15:37:19.364] proxy-tcp proxy-tcp/server1 1/0/50003 694 --
30216/30215/30211/15104/0 0/0

- From 30522 tcp sockets to the proxy-tcp address, there are 30160 in
CLOSE_WAIT state over local address 192.168.1.100:8080. Yesterday I
point that it was from the backend side, but I'm wrong, all are from the
fronted side.


I have the ouput of "show sess all" and "show fd", but as it have a lot
of private information, I send you using an alternative way. If there
are any clear evidence there, I will take the time to anonymize and
share.


Thanks,

El mié, 20-12-2017 a las 18:19 +0100, Willy Tarreau escribió:
> Hello Ricardo,
> 
> On Wed, Dec 20, 2017 at 05:00:33PM +0100, Ricardo Fraile wrote:
> > Hello,
> > 
> > After upgrade from 1.7.4 to 1.8.1, basically with the end of mail conf
> > snippet, the sessions started to grow, as example:
> > 
> > 1.7.4:
> > Active sessions: ~161
> > Active sessions rate: ~425
> > 
> > 1.8.1:
> > Active sessions: ~6700
> > Active sessions rate: ~350
> 
> Ah that's not good :-(
> 
> > Looking into the linux (3.16.7) server, there are a high number of
> > CLOSE_WAIT connections from the bind address of the listen service to
> > the backend nodes.
> 
> Strange, I don't understand well what type of traffic could cause this
> except a loop, that sounds a bit unusual.
> 
> > System logs reported "TCP: too many orphaned sockets", but after
> > increase net.ipv4.tcp_max_orphans value, the message stops but nothing
> > changes.
> 
> Normally orphans correspond to closed sockets for which there are still
> data in the system's buffers so this should be unrelated to the CLOSE_WAIT,
> unless there's a loop somewhere where a backend reconnects to the frontend,
> which can explain both situations at once when the timeout strikes.
> 
> > Haproxy logs reported for that listen the indicator "sD", but only with
> > 1.8.
> 
> Thus a server timeout during the end of the transfer. That doesn't make
> much sense either.
> 
> > Any ideas to dig into the issue?
> 
> It would be very useful that you share your configuration (please remove
> any sensitive info like stats passwords or IP addresses you prefer to keep
> private). When running 1.8, it would be useful to issue the following
> commands on the CLI and capture the output to a file :
>   - "show sess all"
>   - "show fd"
> 
> Warning, the first one will reveal a lot of info (internal addresses etc)
> so you may want to send it privately and not to the list if this is the
> case (though it takes longer to diagnose it :-)).
> 
> If you think you can reproduce this on a test machine out of production,
> that would be extremely useful.
> 
> We have not noticed any single such issue on haproxy.org which has delivered
> about 100 GB and 2 million requests over the last 2 weeks, with this exact
> same version, so that makes me think that either the config or the type of
> traffic count a lot there to trigger the problem you are observing.
> 
> Regards,
> Willy



Reply via email to