Hi Alexander ! On Fri, Aug 06, 2010 at 09:09:56PM +0200, Alexander Staubo wrote: > [Apologies if this reaches the list twice. I sent it approx 10 hours > ago, and it hasn't appeared yet, probably because I used the wrong > sender adress.]
in fact it got there, strange. > We are seeing some requests taking a while before being able to get a > connection through to HAProxy. Using tcpdump we are seeing cases where > the clients needs 9 SYN packets before HAProxy responds to the > connect. Other services on the same box do not suffer the same > problem, so it's definitely HAProxy being overloaded. Is there > anything we can tune to improve the situation? When you observe this, it means that the SYN backlog queue is too short. Haproxy itself does not respond to SYN packets, it's the system which responds to SYN with a SYN-ACK, then when it gets the final ACK from the client, it wakes haproxy up. The size of the backlog is determinated by the MIN() of net.ipv4.tcp_max_syn_backlog, net.core.somaxconn and the parameter passed by the application to the listen() call (here the application being haproxy). By default, haproxy sets the backlog size to the same value as the frontend's maxconn, but you can change the value using the "backlog" parameter in your frontend. But unless your parameter is already extremely low, there is little chance that it will change anything. > Here are the network-related parts of our sysctl config: > > net.core.rmem_max=16777216 > net.core.wmem_max=16777216 > net.ipv4.tcp_rmem=4096 87380 16777216 > net.ipv4.tcp_wmem=4096 65536 16777216 > net.core.netdev_max_backlog=15000 > net.ipv4.tcp_max_tw_buckets = 16777216 > net.core.somaxconn = 262144 > net.ipv4.tcp_tw_recycle = 1 > net.ipv4.tcp_max_syn_backlog = 262144 You settings are good for loads up to around a few thousands requests/s. > HAProxy is serving about ~300 req/s on the box. So there is something else happening (unless your maxconn or backlog is too low, of course). > The processor load is not very high (~3.3 among four cores), and we don't > see any obvioius bottlenecks. However, it is running Varnish as well. What else is running on the machine ? It does not seem possible to have that high a load with that little traffic ! Even my 5-year old notebook does not report any CPU usage at that load :-/ What I'm suspecting is that you're running something heavily multi-threaded or multi-processed that eats all the CPU and that haproxy only gets a small share once in a while. I've seen this happen with the old RHEL3 scheduler and the old 2.6 one as well before it was replaced in 2.6.23 with CFS. The worst cases were when global load was getting close to 50% total CPU usage, it was even possible to see some tasks not get the CPU for more than 30 seconds! > HAProxy 1.3.15.7. I just checked the known bugs for this version, and none seems related to what you describe. Also, it's been running fine for more than one year on an infrastructure where it took about twice the same load. Just something stupid, do you sometimes observe that the affected frontend is marked as "FULL" ? Or does your stats page report the "max" value sometimes reaching the "limit" one in the "Sessions" column ? Maybe we're still encountering occasional delays in response times which cause accumulation of requests to the point that the backlog fills up. In that case, increasing the backlog in order to absorb the pending requests during those global delays could help. Regards, Willy

