On Mon, Aug 16, 2010 at 12:27 AM, Willy Tarreau <[email protected]> wrote: > On Mon, Aug 16, 2010 at 12:02:47AM +0200, Alexander Staubo wrote: >> On Wed, Aug 11, 2010 at 11:22 PM, Willy Tarreau <[email protected]> wrote: >> >> We are seeing some requests taking a while before being able to get a >> >> connection through to HAProxy. Using tcpdump we are seeing cases where >> >> the clients needs 9 SYN packets before HAProxy responds to the >> >> connect. Other services on the same box do not suffer the same >> >> problem, so it's definitely HAProxy being overloaded. Is there >> >> anything we can tune to improve the situation? >> > >> > When you observe this, it means that the SYN backlog queue is too short. >> > Haproxy itself does not respond to SYN packets, it's the system which >> > responds to SYN with a SYN-ACK, then when it gets the final ACK from the >> > client, it wakes haproxy up. >> [...] >> >> After a bit of debugging it seems that the problem is not on the >> server end at all -- but with the ISP. :-/ > > Ouch... smells like an overloaded firewall then...
Never mind! We have been able to reproduce the problem elsewhere, and so it's not ISP-related after all. In fact, it seems to this sysctl setting: # Allow to reuse TIME-WAIT sockets for new connections when it is safe # from protocol viewpoint. The default value is 0 net.ipv4.tcp_tw_recycle = 1 Turns out it has caused problems with Varnish on NATed subnets (such as ours): http://www.mail-archive.com/[email protected]/msg02899.html Relevant quote: > After troubleshooting with the website owner, tcpdumping at various > points on both sides, it was clear that the packets were reaching the > varnish node, but except the last SYN, they were all dropped. This > turned out to be because the varnish node had the tcp_tw_recycle sysctl > enabled. Switching it off fixed the problem. The problem has persisted even after we moved Varnish away from the HAProxy box. So it's apparent that the socket recycling is causing problems with HAProxy as well, not just Varnish. We have now turned the option off temporarily to see if that helps; it's only been half an hour, but no problems yet, so I'm optimistic. I will let you know when we have confirmed that the problem has definitely been solved. Apparently there's a safer option that serves an almost identical purpose, net.ipv4.tcp_tw_reuse. We are going to try enabling that one later.

