Hi Sander,

On Tue, Mar 10, 2020 at 10:28:38AM +0100, Sander Klein wrote:
> Hi All,
> 
> I'm looking in a strange issue I'm having and I start to think it is HAProxy
> related.
> 
> I have a setup with HAProxy serving multiple frontends and multiple backends
> which are Nginx server with PHP-FPM. Sometimes all of the sudden the maxconn
> limit is hit and connections get queued to a backend server and I do not
> have a clue why. The backend is not overloaded, not traffic is flowing,
> Nginx/PHP-FPM picks op other connections like the health checks from HAProxy
> or out monitoring server, PHP-FPM is not doing anything so no long runing
> processes, Nginx is doint nothing,  but it does not receive any new
> connection from HAProxy. Sometimes this is for 1 second, but this also
> happens for as much as 30 seconds.
> 
> It does not happen on all backend servers at once, just random at one
> server. So if I have defined a backend with 2 servers it happens to only one
> at a time.

There could be two possible explanations to this. The first one is that
the backend server is sometimes slowing down for whatever reason on the
requests it is processing, resulting in maxconn being reached on haproxy.
This commonly happens on applications where one request is much more
expensive than most others. I've seen some sites use a dedicated backend
with much lower maxconn for a search button for example because they knew
this search request could take multiple seconds, and if enough of them
happen at the same time, the maxconn is reached and extra requests get
queued even if they could have been handled.

The other possibility would be that some requests produce huge responses
that take a lot of time to be consumed by the clients. And until one
response finishes to be delivered there's no more slots available on the
server.

> I'm running HAProxy 2.0.13 on Debian Buster in a VM. I've tested with 'no
> option http-use-htx' and HAProxy 2.1.3 and I see the problem on both.
> Backends are Nginx with PHP-FPM and only using HTTP/1.1 over port 80, also
> VM's.
> 
> Today I disabled H2 on the frontends and now the problem seems to have
> disappeared.  So it seems to be releated to that part. But, I'm not sure.
> How should I go on and debug this?

The best way to do it is to emit "show sess all", "show info", "show stats"
and "show fd" on the CLI when this happens. This will indicate if there are
many connections still active to the server under trouble, and/or if there
are many connections from a single source address for example.

One thing that could happen with H2 would be that it's trivial to send
many requests at once (100 by default), and if someone wants to have fun
with your site once they've found an expensive request, sending 100 of
these doesn't take more than a single TCP segment. Or if you're having
heavy pages, a single H2 connection can request many objects at once and
if the network link to the client is lossy, these can take a while to
deliver over a single connection. And since H2 is much faster than H1
on reliable networks, but much worse on lossy networks, that could be
an explanation. Do you have lots of static files ? If so it might make
sense to deliver them from dedicated servers that are not subject to
the very low maxconn. And if such objects are small, you could also
enable some caching to reduce the number of connections to the servers
when fetching them.

> The config looks a bit like this (very redacted and very, very much
> shortened):
(...)

Looks pretty fine at first glance. I'm seeing "prefer-last-server", it
*might* participate to the problem if it's caused by sudden spikes, as
it will encourage requests from a same client to go to the same server,
but that's not necessarily the case.

Willy

Reply via email to