Hi, Chris Allen and Jeff Zellner reported a similar issue at the same time on two different versions : 1.4.20 and 1.5-dev17. The symptom is always the same, haproxy suddenly started to crash under load while it did not in the past.
When looking deeper into the traces and core files, it happens that both versions were built with TARGET=generic, so haproxy was using select() to poll for new events. The issue was tracked down to a recent update to glibc which now verifies that the file descriptor number passed to FD_SET/FD_CLR/ FD_ISSET is comprised between 0 and FD_SETSIZE-1 (1023) : http://repo.or.cz/w/glibc.git/commitdiff/a0f33f996 I believe it was merged into glibc 2.16 and backported in the glibc 2.15 as shipped with Ubuntu 12.04. This introduces a regression on something which has worked for ages an was not clearly documented in the past. At least on Linux, Solaris, FreeBSD and OpenBSD, I've been used to successfully pass select() with FD sets which were arrays of fd_set[] in order to support more than FD_SETSIZE fds. I've been doing this for 18 years without issues and older man pages did not suggest this was not the expected way to use them. Now we get some warnings in updated Linux man pages : An fd_set is a fixed size buffer. Executing FD_CLR or FD_SET with a value of fd that is negative or is equal to or larger than FD_SETSIZE will result in undefined behavĀ ior. Moreover, POSIX requires fd to be a valid file descriptor. Some implementations (eg: FreeBSD) explicitly suggest that FD_SETSIZE may be redefined to workaround the limitation. POSIX says select() returns EINVAL when nfds >= FD_SETSIZE, but the limit cannot be passed to select() as a function if FD_SETSIZE is redefined... So now at least, what glibc does is to enforce what is documented, even if the code below it works flawlessly. At least I'm seeing that haproxy is not the only program to be affected by this, as I've found bug reports for the same crash in qemu, freerdp and httperf (which is how I found about the glibc commit). It happens that historically, for its poll() implementation, haproxy reused the fd_sets which in order to limit the changes from select(), and at the moment, its poll() implementation relies on FD_* macros. So on Linux, both select() and poll() are affected by this change. What needs to be done if you're using haproxy with a recent Linux system : - run "haproxy -vv" to ensure you're using epoll or sepoll, instead of poll or select : $ ./haproxy -vv HA-Proxy version 1.5-dev17-85 2013/03/25 Copyright 2000-2012 Willy Tarreau <w...@1wt.eu> Build options : TARGET = linux26 <============ not "generic" here CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing OPTIONS = (...) Available polling systems : epoll : pref=300, test result OK poll : pref=200, test result OK select : pref=150, test result OK Total: 3 (3 usable), will use epoll. ^^^^^^^^^^^^^^^ this one is OK - if you have an old build which uses a generic target and poll or select, either you feel confident in rebuilding it, or you know you have an old glibc and are not affected ; - check if your glibc is >= 2.15 : $ ls -la /lib*/libc-*.so* If you have an older version, you're safe. For the medium term, I'm going to prepare the following changes : - make poll() rely solely on bit fields without using FD_* macros - add a start up warning when select() is used with a maxconn leading to more than FD_SETSIZE fds, followed by a runtime test to make it crash in glibc while parsing the config if needed instead of reserving a friday evening surprize for you. - enable poll() by default in the generic target, as it's supported on all platforms where haproxy is known to build I'm open to any other suggestions you may have. Regards, Willy