Hi Willy, Thanks a lot for this investigation, it was really helpful.
My OpenSSL is up-to-date on this server. I first tried to remove the chroot statement. I'm pretty sure this in itself solved the leak, but I no longer have the traces and couple of hours after, our Ops changed the SSL check to a simple TCP check on port 443. So, I cannot confirm 100%. I can however confirm that I no longer experience the leak. I put back the chroot command to be safer. This also prompted me to tweak the SSL ciphers. I now use a more thoughtful list of ciphers ( https://mozilla.github.io/server-side-tls/ssl-config-generator/) and disabled SSLv3. This indeed disables KRB5. ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA ssl-default-bind-options no-sslv3 I will keep a close eye on the memory usage... HAproxy has been running for about 16 hours now, and here is the ps output: # ps -u nobody u USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND nobody 63985 0.5 0.0 53868 10960 ? Ss Feb02 5:19 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid Looks good :-) -- Georges-Etienne On Mon, Feb 2, 2015 at 10:25 AM, Willy Tarreau <w...@1wt.eu> wrote: > Georges-Etienne, > > your captures were extremely informative. While I cannot reproduce the > behaviour here even by reinjecting the same health check requests, I'm > seeing two really odd things in your trace below : > > We accept an SSL connection from the firewall : > > 08:15:52.297357 accept(6, {sa_family=AF_INET, sin_port=htons(32764), > sin_addr=inet_addr("<firewall>")}, [16]) = 1 > > It sends 48 bytes : > > 08:15:52.297717 read(1, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11 > 08:15:52.297831 read(1, > "\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5O\0\0@\202J#i\242K7)\300\2536o\245=\23", > 37) = 37 > > Then we're checking for /etc/krb5.conf : > > 08:15:52.297984 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No > such file or directory) > > Then trying to read some random : > > 08:15:52.298082 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file > or directory) > > Then trying to figure the local host name : > > 08:15:52.298187 uname({sys="Linux", node="<node's local hostname>", ...}) > = 0 > > Then doing some netlink-based studd : > > 08:15:52.298316 socket(PF_NETLINK, SOCK_RAW, 0) = 2 > 08:15:52.298395 bind(2, {sa_family=AF_NETLINK, pid=0, groups=00000000}, > 12) = 0 > 08:15:52.298471 getsockname(2, {sa_family=AF_NETLINK, pid=9103, > groups=00000000}, [12]) = 0 > 08:15:52.298550 sendto(2, "\24\0\0\0\26\0\1\3\210x\317T\0\0\0\0\0\0\0\0", > 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20 > 08:15:52.298650 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, > groups=00000000}, > msg_iov(1)=[{"0\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1\10\0\2\0\177\0\0\1\7\0\3\0lo\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\n\0\0\0\10\0\1\0\n\0\35\22\10\0\2\0\n\0\35\22\10\0\4\0\n\0\35?\n\0\3\0bond0\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\f\0\0\0\10\0\1\0\n\2\177\217\10\0\2\0\n\2\177\217\10\0\4\0\n\2\177\277\n\0\3\0bond2\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\r\0\0\0\10\0\1\0\nZ\6j"..., > 4096}], msg_controllen=0, msg_flags=0}, 0) = 356 > 08:15:52.298841 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, > groups=00000000}, > msg_iov(1)=[{"@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@ > \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@ > \200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@ > \24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@ > \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@ > \200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"..., > 4096}], msg_controllen=0, msg_flags=0}, 0) = 448 > 08:15:52.299059 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, > groups=00000000}, > msg_iov(1)=[{"\24\0\0\0\3\0\2\0\210x\317T\217#\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@ > \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@ > \200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@ > \24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@ > \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@ > \200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"..., > 4096}], msg_controllen=0, msg_flags=0}, 0) = 20 > 08:15:52.299242 close(2) = 0 > > Then trying to open nsswitch.conf : > > 08:15:52.299353 open("/etc/nsswitch.conf", O_RDONLY) = -1 ENOENT (No such > file or directory) > > Then does the netlink + nsswitch dance a second time, followed by about > 10 times the following with various domain name suffixes : > > 08:15:52.300841 open("/etc/resolv.conf", O_RDONLY) = -1 ENOENT (No such > file or directory) > 08:15:52.300938 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 2 > 08:15:52.301018 connect(2, {sa_family=AF_INET, sin_port=htons(53), > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > 08:15:52.301100 poll([{fd=2, events=POLLOUT}], 1, 0) = 1 ([{fd=2, > revents=POLLOUT}]) > 08:15:52.301179 sendto(2, "\327\r\1\0\0\1\0\0\0\0\0\0\t_kerberos\f<various > domain suffixes>\0\0\20\0\1", 51, MSG_NOSIGNAL, NULL, 0) = 51 > 08:15:52.301296 poll([{fd=2, events=POLLIN}], 1, 5000) = 1 ([{fd=2, > revents=POLLERR}]) > 08:15:52.301373 close(2) = 0 > > Etc. It does that *a lot*. A few times we're seeing brk() with an > increasing value though it's not huge enough to prove everything leaks > there, but it proves that it happens inside openssl, since it's between > a read() performed by openssl and a stat() performed by it as well : > > 08:16:02.055371 epoll_wait(0, {{EPOLLIN, {u32=6, u64=6}}, {EPOLLIN, > {u32=5, u64=5}}}, 200, 159) = 2 > 08:16:02.055457 accept(6, {sa_family=AF_INET, sin_port=htons(13053), > sin_addr=inet_addr(<some-public-address>)}, [16]) = 2 > 08:16:02.055550 fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 > 08:16:02.055658 accept(6, 0x7fff544b3e70, [128]) = -1 EAGAIN (Resource > temporarily unavailable) > 08:16:02.055806 read(2, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11 > 08:16:02.055908 read(2, > "\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5\233G\314"..., 37) = 37 > 08:16:02.056005 brk(0x5062d000) = 0x5062d000 > 08:16:02.056134 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No > such file or directory) > 08:16:02.056228 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file > or directory) > > OpenSSL sometimes acts stupidly like this inside a chroot. We've > encountered a few issues in the past with openssl doing totally crazy > stuff inside a chroot, including abort() on krb5-related things. From > what I understood (others, please correct me if I'm wrong), such > processing may be altered by the type of key or ciphers. > > In my opinion, you should attempt two things : > > 1) ensure that your ssl library is up to date (double checking doesn't > cost much) > > 2) try it again without the chroot statement to see if when openssl finds > what it's looking for, the leak stops. > > 3) maybe file a report to the openssl list about a memory leak in that > exact situation, with the traces you sent to me. Maybe they'll want > to have your public key as well to verify some assumptions about > what could be done inside the lib with its properties. > > Would you be able to simply stop the firewall's incoming checks on port > 443 to confirm it's enough to stop the leak ? Another option might > consist in starting two distinct haproxy processes, one for 80 and > another one for 443. > > At this point, I guess I'm running out of ideas :-/ > > Best regards, > Willy > >