Hi,
On Wed, Apr 10, 2019 at 01:20:34PM -0700, LCF wrote:
> Every few days I see some servers with few hundreds connections in
> CLOSE_WAIT state for hours. I tried suggested earlier here - "show fd" to
> construct a bug report but whenever I run "show fd" (echo 'show fd' | socat
> stdio /run/haproxy/haproxy.sock) all CPU cores are with 100% utilization
> and haproxy is unresponsive (needs to be restarted).
This is an extremely useful report! It indicates that there is a locking
issue on a thread. The CPU core at 100% is likely looping on a busy lock,
and when you issue "show fd" it requires all other threads to stop doing
anything, which the first one doesn't do, hence the totally locked situation
you're facing. Note that there are a few other situations where this could
possibly happen like a soft restart or servers changing state.
The CLOSE_WAIT state is expected in this case as it corresponds to closed
sockets that were handled by the blocked thread and that are thus not
being acknowledged as closed.
Next time it happens, it would be extremely useful if you could generate
a core using gdb ("generate-core-file") and share it along with your
executable (unstripped and built with -g please and if possible). Ideally
please issue this without attempting the show fd so that we can more
easily observe what thread is being blocked and try to figure why.
Thank you!
Willy