Hi Alexander !

On Fri, Aug 06, 2010 at 09:09:56PM +0200, Alexander Staubo wrote:
> [Apologies if this reaches the list twice. I sent it approx 10 hours
> ago, and it hasn't appeared yet, probably because I used the wrong
> sender adress.]

in fact it got there, strange.

> We are seeing some requests taking a while before being able to get a
> connection through to HAProxy. Using tcpdump we are seeing cases where
> the clients needs 9 SYN packets before HAProxy responds to the
> connect. Other services on the same box do not suffer the same
> problem, so it's definitely HAProxy being overloaded. Is there
> anything we can tune to improve the situation?

When you observe this, it means that the SYN backlog queue is too short.
Haproxy itself does not respond to SYN packets, it's the system which
responds to SYN with a SYN-ACK, then when it gets the final ACK from the
client, it wakes haproxy up.

The size of the backlog is determinated by the MIN() of
net.ipv4.tcp_max_syn_backlog, net.core.somaxconn and the parameter passed
by the application to the listen() call (here the application being haproxy).
By default, haproxy sets the backlog size to the same value as the frontend's
maxconn, but you can change the value using the "backlog" parameter in your
frontend. But unless your parameter is already extremely low, there is little
chance that it will change anything.

> Here are the network-related parts of our sysctl config:
> 
> net.core.rmem_max=16777216
> net.core.wmem_max=16777216
> net.ipv4.tcp_rmem=4096 87380 16777216
> net.ipv4.tcp_wmem=4096 65536 16777216
> net.core.netdev_max_backlog=15000
> net.ipv4.tcp_max_tw_buckets = 16777216
> net.core.somaxconn = 262144
> net.ipv4.tcp_tw_recycle = 1
> net.ipv4.tcp_max_syn_backlog = 262144

You settings are good for loads up to around a few thousands requests/s.

> HAProxy is serving about ~300 req/s on the box.

So there is something else happening (unless your maxconn or backlog is
too low, of course).

> The processor load is not very high (~3.3 among four cores), and we don't
> see any obvioius bottlenecks. However, it is running Varnish as well.

What else is running on the machine ? It does not seem possible to have
that high a load with that little traffic ! Even my 5-year old notebook
does not report any CPU usage at that load :-/

What I'm suspecting is that you're running something heavily multi-threaded
or multi-processed that eats all the CPU and that haproxy only gets a small
share once in a while. I've seen this happen with the old RHEL3 scheduler
and the old 2.6 one as well before it was replaced in 2.6.23 with CFS. The
worst cases were when global load was getting close to 50% total CPU usage,
it was even possible to see some tasks not get the CPU for more than 30
seconds!

> HAProxy 1.3.15.7.

I just checked the known bugs for this version, and none seems related to
what you describe. Also, it's been running fine for more than one year on
an infrastructure where it took about twice the same load.

Just something stupid, do you sometimes observe that the affected frontend
is marked as "FULL" ? Or does your stats page report the "max" value sometimes
reaching the "limit" one in the "Sessions" column ? Maybe we're still
encountering occasional delays in response times which cause accumulation
of requests to the point that the backlog fills up. In that case, increasing
the backlog in order to absorb the pending requests during those global
delays could help.

Regards,
Willy


Reply via email to