Hello Ricardo,

On Wed, Dec 20, 2017 at 05:00:33PM +0100, Ricardo Fraile wrote:
> Hello,
> 
> After upgrade from 1.7.4 to 1.8.1, basically with the end of mail conf
> snippet, the sessions started to grow, as example:
> 
> 1.7.4:
> Active sessions: ~161
> Active sessions rate: ~425
> 
> 1.8.1:
> Active sessions: ~6700
> Active sessions rate: ~350

Ah that's not good :-(

> Looking into the linux (3.16.7) server, there are a high number of
> CLOSE_WAIT connections from the bind address of the listen service to
> the backend nodes.

Strange, I don't understand well what type of traffic could cause this
except a loop, that sounds a bit unusual.

> System logs reported "TCP: too many orphaned sockets", but after
> increase net.ipv4.tcp_max_orphans value, the message stops but nothing
> changes.

Normally orphans correspond to closed sockets for which there are still
data in the system's buffers so this should be unrelated to the CLOSE_WAIT,
unless there's a loop somewhere where a backend reconnects to the frontend,
which can explain both situations at once when the timeout strikes.

> Haproxy logs reported for that listen the indicator "sD", but only with
> 1.8.

Thus a server timeout during the end of the transfer. That doesn't make
much sense either.

> Any ideas to dig into the issue?

It would be very useful that you share your configuration (please remove
any sensitive info like stats passwords or IP addresses you prefer to keep
private). When running 1.8, it would be useful to issue the following
commands on the CLI and capture the output to a file :
  - "show sess all"
  - "show fd"

Warning, the first one will reveal a lot of info (internal addresses etc)
so you may want to send it privately and not to the list if this is the
case (though it takes longer to diagnose it :-)).

If you think you can reproduce this on a test machine out of production,
that would be extremely useful.

We have not noticed any single such issue on haproxy.org which has delivered
about 100 GB and 2 million requests over the last 2 weeks, with this exact
same version, so that makes me think that either the config or the type of
traffic count a lot there to trigger the problem you are observing.

Regards,
Willy

Reply via email to