We've been running across a fair amount of haproxy processes lately that
won't shut down. We're currently using 1.7.5, but have also experienced
the issue with earlier versions, 1.7.2 for sure, but likely back even
further.
The processes are getting signaled to shut down by the
haproxy-systemd-wrapper after sending it a SIGHUP.
The last thing logged by the process was all the "Stopping frontend"
"Stopping backend" and "Proxy XXX stopped" messages.
When I do an `lsof -p XXX` I get:
# lsof -p 28856
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE
NAME
haproxy 28856 root cwd DIR 253,0 4096 128 /
haproxy 28856 root rtd DIR 253,0 4096 128 /
haproxy 28856 root txt REG 253,0 1562240 25168059
/usr/sbin/haproxy
haproxy 28856 root DEL REG 0,4 420037375
/dev/zero
haproxy 28856 root mem REG 253,0 62184 26659777
/usr/lib64/libnss_files-2.17.so
haproxy 28856 root mem REG 253,0 155744 25213445
/usr/lib64/libselinux.so.1
haproxy 28856 root mem REG 253,0 111080 26659787
/usr/lib64/libresolv-2.17.so
haproxy 28856 root mem REG 253,0 15688 25315637
/usr/lib64/libkeyutils.so.1.5
haproxy 28856 root mem REG 253,0 62744 25394528
/usr/lib64/libkrb5support.so.0.1
haproxy 28856 root mem REG 253,0 143944 26659785
/usr/lib64/libpthread-2.17.so
haproxy 28856 root mem REG 253,0 202568 25300495
/usr/lib64/libk5crypto.so.3.1
haproxy 28856 root mem REG 253,0 15848 25213462
/usr/lib64/libcom_err.so.2.1
haproxy 28856 root mem REG 253,0 959008 25394526
/usr/lib64/libkrb5.so.3.3
haproxy 28856 root mem REG 253,0 324888 25300491
/usr/lib64/libgssapi_krb5.so.2.2
haproxy 28856 root mem REG 253,0 11384 25167850
/usr/lib64/libfreebl3.so
haproxy 28856 root mem REG 253,0 2118128 25167885
/usr/lib64/libc-2.17.so
haproxy 28856 root mem REG 253,0 398264 25195400
/usr/lib64/libpcre.so.1.2.0
haproxy 28856 root mem REG 253,0 11112 25195408
/usr/lib64/libpcreposix.so.0.0.1
haproxy 28856 root mem REG 253,0 1141928 26148751
/usr/lib64/libm-2.17.so
haproxy 28856 root mem REG 253,0 2025472 25300659
/usr/lib64/libcrypto.so.1.0.1e
haproxy 28856 root mem REG 253,0 454024 25300661
/usr/lib64/libssl.so.1.0.1e
haproxy 28856 root mem REG 253,0 19776 26148750
/usr/lib64/libdl-2.17.so
haproxy 28856 root mem REG 253,0 90664 25213451
/usr/lib64/libz.so.1.2.7
haproxy 28856 root mem REG 253,0 41080 25167891
/usr/lib64/libcrypt-2.17.so
haproxy 28856 root mem REG 253,0 155464 26148745
/usr/lib64/ld-2.17.so
haproxy 28856 root 0u a_inode 0,9 0 5823
[eventpoll]
haproxy 28856 root 1u IPv4 420797940 0t0 TCP
10.0.33.145:35754->10.0.33.147:1029 (CLOSE_WAIT)
haproxy 28856 root 2u IPv4 420266351 0t0 TCP
10.0.33.145:52898->10.0.33.147:1029 (CLOSE_WAIT)
haproxy 28856 root 3r REG 0,3 0 4026531956 net
haproxy 28856 root 4u IPv4 422150834 0t0 TCP
10.0.33.145:38874->10.0.33.147:1029 (CLOSE_WAIT)
haproxy 28856 root 5r REG 0,3 0 4026532437 net
haproxy 28856 root 6r REG 0,3 0 4026531956 net
haproxy 28856 root 13u unix 0xffff88009af6e800 0t0 420037384
socket
All those sockets have been sitting there like that for a long time.
The :1029 sockets are "peer" sync connections.
File descriptor 13 is likely one of:
* The syslog connection to /dev/log
* A dead connection from an SSL worker process. We use nbproc>1 with
dedicated processes handling SSL termination, and then unix domain
sockets to forward to the main haproxy process. PID 28856 is the main
process, not an SSL terminator. The SSL terminator processes are already
shut down, so there's nothing on the other end of that socket.
I'm not sure what the other "net" sockets are.
When I `strace -p XXX` I get:
# strace -p 28856
Process 28856 attached
epoll_wait(0, {}, 200, 319) = 0
epoll_wait(0, {}, 200, 0) = 0
epoll_wait(0, {}, 200, 362) = 0
epoll_wait(0, {}, 200, 0) = 0
epoll_wait(0, {}, 200, 114) = 0
epoll_wait(0, {}, 200, 0) = 0
epoll_wait(0, {}, 200, 203) = 0
epoll_wait(0, {}, 200, 0) = 0
epoll_wait(0, {}, 200, 331) = 0
epoll_wait(0, {}, 200, 0) = 0
When I do `bt full` in gdb I get:
(gdb) bt full
#0 0x00007f5f3efdacf3 in __epoll_wait_nocancel () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f5f409a4c7c in _do_poll (p=<optimized out>, exp=910827830) at
src/ev_epoll.c:125
status = <optimized out>
eo = <optimized out>
fd = <optimized out>
opcode = <optimized out>
count = <optimized out>
updt_idx = <optimized out>
wait_time = 831
#2 0x00007f5f409052d8 in run_poll_loop () at src/haproxy.c:1741
next = <optimized out>
#3 0x00007f5f409014fd in main (argc=<optimized out>, argv=<optimized
out>) at src/haproxy.c:2104
err = <optimized out>
retry = <optimized out>
limit = {rlim_cur = 131149, rlim_max = 131149}
errmsg =
"\000\000\000\000\000\000\000\000\274/\366>_\177\000\000[\001\000\000\000\000\000\000\030\000\000\000\000\000\000\000nÚ·?_\177\000\000\223\065\247?_\177\000\000\020\006\006@_\177\000\000`\212\305@_\177\000\000\020\006\006@_\177\000\000\260\340\305@_\177\000\000\300L\274\230\374\177\000\000{\353\256?_\177\000\000>\001\000\024"
pidfd = <optimized out>
When I look at the /proc/XXX/fdinfo/0 (the epoll file descriptor) I get:
# cat /proc/28856/fdinfo/0
pos: 0
flags: 02
mnt_id: 10
Note that there are no file descriptors listed, so the epoll handle is
empty.
I can provide the config if desired, but it's very large, and I'll have
to strip info out of it.
-Patrick