Georges-Etienne,

your captures were extremely informative. While I cannot reproduce the
behaviour here even by reinjecting the same health check requests, I'm
seeing two really odd things in your trace below :

We accept an SSL connection from the firewall :

08:15:52.297357 accept(6, {sa_family=AF_INET, sin_port=htons(32764), 
sin_addr=inet_addr("<firewall>")}, [16]) = 1

It sends 48 bytes :

08:15:52.297717 read(1, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11
08:15:52.297831 read(1, 
"\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5O\0\0@\202J#i\242K7)\300\2536o\245=\23",
 37) = 37

Then we're checking for /etc/krb5.conf :

08:15:52.297984 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No such 
file or directory)

Then trying to read some random :

08:15:52.298082 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file or 
directory)

Then trying to figure the local host name :

08:15:52.298187 uname({sys="Linux", node="<node's local hostname>", ...}) = 0

Then doing some netlink-based studd :

08:15:52.298316 socket(PF_NETLINK, SOCK_RAW, 0) = 2
08:15:52.298395 bind(2, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
08:15:52.298471 getsockname(2, {sa_family=AF_NETLINK, pid=9103, 
groups=00000000}, [12]) = 0
08:15:52.298550 sendto(2, "\24\0\0\0\26\0\1\3\210x\317T\0\0\0\0\0\0\0\0", 20, 
0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
08:15:52.298650 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"0\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1\10\0\2\0\177\0\0\1\7\0\3\0lo\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\n\0\0\0\10\0\1\0\n\0\35\22\10\0\2\0\n\0\35\22\10\0\4\0\n\0\35?\n\0\3\0bond0\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\f\0\0\0\10\0\1\0\n\2\177\217\10\0\2\0\n\2\177\217\10\0\4\0\n\2\177\277\n\0\3\0bond2\0\0\0<\0\0\0\24\0\2\0\210x\317T\217#\0\0\2\32\200\0\r\0\0\0\10\0\1\0\nZ\6j"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 356
08:15:52.298841 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n\200\200\376\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n@\200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@\24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n@\200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 448
08:15:52.299059 recvmsg(2, {msg_name(12)={sa_family=AF_NETLINK, pid=0, 
groups=00000000}, 
msg_iov(1)=[{"\24\0\0\0\3\0\2\0\210x\317T\217#\0\0\0\0\0\0\1\0\0\0\24\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\24\0\6\0\377\377\377\377\377\377\377\377H\3\0\0H\3\0\0@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n@\200\375\n\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37@\24\0\6\0\377\377\377\377\377\377\377\377f\5\0\0f\5\0\0@\0\0\0\24\0\2\0\210x\317T\217#\0\0\n@\200\375\v\0\0\0\24\0\1\0\376\200\0\0\0\0\0\0\3064k\377\376\256\37A\24\0\6\0\377\377\377\377\377\377\377\377\232\5\0\0\232\5\0\0@\0\0\0\24\0\2\0"...,
 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
08:15:52.299242 close(2)                = 0

Then trying to open nsswitch.conf :

08:15:52.299353 open("/etc/nsswitch.conf", O_RDONLY) = -1 ENOENT (No such file 
or directory)

Then does the netlink + nsswitch dance a second time, followed by about
10 times the following with various domain name suffixes :

08:15:52.300841 open("/etc/resolv.conf", O_RDONLY) = -1 ENOENT (No such file or 
directory)
08:15:52.300938 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 2
08:15:52.301018 connect(2, {sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
08:15:52.301100 poll([{fd=2, events=POLLOUT}], 1, 0) = 1 ([{fd=2, 
revents=POLLOUT}])
08:15:52.301179 sendto(2, "\327\r\1\0\0\1\0\0\0\0\0\0\t_kerberos\f<various 
domain suffixes>\0\0\20\0\1", 51, MSG_NOSIGNAL, NULL, 0) = 51
08:15:52.301296 poll([{fd=2, events=POLLIN}], 1, 5000) = 1 ([{fd=2, 
revents=POLLERR}])
08:15:52.301373 close(2)                = 0

Etc. It does that *a lot*. A few times we're seeing brk() with an
increasing value though it's not huge enough to prove everything leaks
there, but it proves that it happens inside openssl, since it's between
a read() performed by openssl and a stat() performed by it as well :

08:16:02.055371 epoll_wait(0, {{EPOLLIN, {u32=6, u64=6}}, {EPOLLIN, {u32=5, 
u64=5}}}, 200, 159) = 2
08:16:02.055457 accept(6, {sa_family=AF_INET, sin_port=htons(13053), 
sin_addr=inet_addr(<some-public-address>)}, [16]) = 2
08:16:02.055550 fcntl(2, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
08:16:02.055658 accept(6, 0x7fff544b3e70, [128]) = -1 EAGAIN (Resource 
temporarily unavailable)
08:16:02.055806 read(2, "\200.\1\3\0\0\25\0\0\0\20", 11) = 11
08:16:02.055908 read(2, "\0\0\3\0\0\10\0\0\6\4\0\200\0\0\4\0\0\5\233G\314"..., 
37) = 37
08:16:02.056005 brk(0x5062d000)         = 0x5062d000
08:16:02.056134 stat("/etc/krb5.conf", 0x7fff544b1990) = -1 ENOENT (No such 
file or directory)
08:16:02.056228 open("/dev/urandom", O_RDONLY) = -1 ENOENT (No such file or 
directory)

OpenSSL sometimes acts stupidly like this inside a chroot. We've
encountered a few issues in the past with openssl doing totally crazy
stuff inside a chroot, including abort() on krb5-related things. From
what I understood (others, please correct me if I'm wrong), such
processing may be altered by the type of key or ciphers.

In my opinion, you should attempt two things :

1) ensure that your ssl library is up to date (double checking doesn't
   cost much)

2) try it again without the chroot statement to see if when openssl finds
   what it's looking for, the leak stops.

3) maybe file a report to the openssl list about a memory leak in that
   exact situation, with the traces you sent to me. Maybe they'll want
   to have your public key as well to verify some assumptions about
   what could be done inside the lib with its properties.

Would you be able to simply stop the firewall's incoming checks on port
443 to confirm it's enough to stop the leak ? Another option might
consist in starting two distinct haproxy processes, one for 80 and
another one for 443.

At this point, I guess I'm running out of ideas :-/

Best regards,
Willy


Reply via email to