Hi Willy,

Thanks a lot for this investigation, it was really helpful.

My OpenSSL is up-to-date on this server. I first tried to remove the chroot
statement. I'm pretty sure this in itself solved the leak, but I no longer
have the traces and couple of hours after, our Ops changed the SSL check to
a simple TCP check on port 443. So, I cannot confirm 100%.

I can however confirm that I no longer experience the leak. I put back the
chroot command to be safer.

This also prompted me to tweak the SSL ciphers. I now use a more thoughtful
list of ciphers (
https://mozilla.github.io/server-side-tls/ssl-config-generator/) and
disabled SSLv3. This indeed disables KRB5.

    ssl-default-bind-ciphers
ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA
    ssl-default-bind-options no-sslv3

I will keep a close eye on the memory usage... HAproxy has been running for
about 16 hours now, and here is the ps output:
# ps -u nobody u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody   63985  0.5  0.0  53868 10960 ?        Ss   Feb02   5:19
/usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid

Looks good :-)

-- Georges-Etienne

On Mon, Feb 2, 2015 at 10:25 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Georges-Etienne,
>
> your captures were extremely informative. While I cannot reproduce the
> behaviour here even by reinjecting the same health check requests, I'm
> seeing two really odd things in your trace below :
>
> We accept an SSL connection from the firewall :
>
> 08:15:52.297357 accept(6, {sa_family=AF_INET, sin_port=htons(32764),
> sin_addr=inet_addr("<firewall>")}, [16]) = 1
>
> It sends 48 bytes :
>
> 08:15:52.297717 read(1, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11
> 08:15:52.297831 read(1, 
> "\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5O\0\0@\202J#i\242K7)\300\2536o\245=\23",
> 37) = 37
>
> Then we're checking for /etc/krb5.conf :
>
> 08:15:52.297984 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No
> such file or directory)
>
> Then trying to read some random :
>
> 08:15:52.298082 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file
> or directory)
>
> Then trying to figure the local host name :
>
> 08:15:52.298187 uname({sys="Linux", node="<node's local hostname>", ...})
> = 0
>
> Then doing some netlink-based studd :
>
> 08:15:52.298316 socket(PF_NETLINK, SOCK_RAW, 0) = 2
> 08:15:52.298395 bind(2, {sa_family=AF_NETLINK, pid=0, groups=00000000},
> 12) = 0
> 08:15:52.298471 getsockname(2, {sa_family=AF_NETLINK, pid=9103,
> groups=00000000}, [12]) = 0
> 08:15:52.298550 sendto(2, "\24\0\0\0\26\0\1\3\210x\317T\0\0\0\0\0\0\0\0",
> 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
> 08:15:52.298650 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
> groups=00000000},
> msg_iov(1)=[{"0\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1\10\0\2\0\177\0\0\1\7\0\3\0lo\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\n\0\0\0\10\0\1\0\n\0\35\22\10\0\2\0\n\0\35\22\10\0\4\0\n\0\35?\n\0\3\0bond0\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\f\0\0\0\10\0\1\0\n\2\177\217\10\0\2\0\n\2\177\217\10\0\4\0\n\2\177\277\n\0\3\0bond2\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\r\0\0\0\10\0\1\0\nZ\6j"...,
> 4096}], msg_controllen=0, msg_flags=0}, 0) = 356
> 08:15:52.298841 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
> groups=00000000},
> msg_iov(1)=[{"@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@
> \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@
> \200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@
> \24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@
> \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@
> \200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"...,
> 4096}], msg_controllen=0, msg_flags=0}, 0) = 448
> 08:15:52.299059 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
> groups=00000000},
> msg_iov(1)=[{"\24\0\0\0\3\0\2\0\210x\317T\217#\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@
> \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@
> \200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@
> \24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@
> \0\0\0\24\0\2\0\210x\317T\217#\0\0\n@
> \200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"...,
> 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
> 08:15:52.299242 close(2)                = 0
>
> Then trying to open nsswitch.conf :
>
> 08:15:52.299353 open("/etc/nsswitch.conf", O_RDONLY) = -1 ENOENT (No such
> file or directory)
>
> Then does the netlink + nsswitch dance a second time, followed by about
> 10 times the following with various domain name suffixes :
>
> 08:15:52.300841 open("/etc/resolv.conf", O_RDONLY) = -1 ENOENT (No such
> file or directory)
> 08:15:52.300938 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 2
> 08:15:52.301018 connect(2, {sa_family=AF_INET, sin_port=htons(53),
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> 08:15:52.301100 poll([{fd=2, events=POLLOUT}], 1, 0) = 1 ([{fd=2,
> revents=POLLOUT}])
> 08:15:52.301179 sendto(2, "\327\r\1\0\0\1\0\0\0\0\0\0\t_kerberos\f<various
> domain suffixes>\0\0\20\0\1", 51, MSG_NOSIGNAL, NULL, 0) = 51
> 08:15:52.301296 poll([{fd=2, events=POLLIN}], 1, 5000) = 1 ([{fd=2,
> revents=POLLERR}])
> 08:15:52.301373 close(2)                = 0
>
> Etc. It does that *a lot*. A few times we're seeing brk() with an
> increasing value though it's not huge enough to prove everything leaks
> there, but it proves that it happens inside openssl, since it's between
> a read() performed by openssl and a stat() performed by it as well :
>
> 08:16:02.055371 epoll_wait(0, {{EPOLLIN, {u32=6, u64=6}}, {EPOLLIN,
> {u32=5, u64=5}}}, 200, 159) = 2
> 08:16:02.055457 accept(6, {sa_family=AF_INET, sin_port=htons(13053),
> sin_addr=inet_addr(<some-public-address>)}, [16]) = 2
> 08:16:02.055550 fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> 08:16:02.055658 accept(6, 0x7fff544b3e70, [128]) = -1 EAGAIN (Resource
> temporarily unavailable)
> 08:16:02.055806 read(2, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11
> 08:16:02.055908 read(2,
> "\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5\233G\314"..., 37) = 37
> 08:16:02.056005 brk(0x5062d000)         = 0x5062d000
> 08:16:02.056134 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No
> such file or directory)
> 08:16:02.056228 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file
> or directory)
>
> OpenSSL sometimes acts stupidly like this inside a chroot. We've
> encountered a few issues in the past with openssl doing totally crazy
> stuff inside a chroot, including abort() on krb5-related things. From
> what I understood (others, please correct me if I'm wrong), such
> processing may be altered by the type of key or ciphers.
>
> In my opinion, you should attempt two things :
>
> 1) ensure that your ssl library is up to date (double checking doesn't
>    cost much)
>
> 2) try it again without the chroot statement to see if when openssl finds
>    what it's looking for, the leak stops.
>
> 3) maybe file a report to the openssl list about a memory leak in that
>    exact situation, with the traces you sent to me. Maybe they'll want
>    to have your public key as well to verify some assumptions about
>    what could be done inside the lib with its properties.
>
> Would you be able to simply stop the firewall's incoming checks on port
> 443 to confirm it's enough to stop the leak ? Another option might
> consist in starting two distinct haproxy processes, one for 80 and
> another one for 443.
>
> At this point, I guess I'm running out of ideas :-/
>
> Best regards,
> Willy
>
>

Reply via email to