Re: v1.9.6 high CPU usage

2019-04-11 Thread Willy Tarreau
Hi,

On Wed, Apr 10, 2019 at 01:20:34PM -0700, LCF wrote:
> Every few days I see some servers with few hundreds connections in
> CLOSE_WAIT state for hours. I tried suggested earlier here - "show fd" to
> construct a bug report but whenever I run "show fd" (echo 'show fd' | socat
> stdio /run/haproxy/haproxy.sock) all CPU cores are with 100% utilization
> and haproxy is unresponsive (needs to be restarted).

This is an extremely useful report! It indicates that there is a locking
issue on a thread. The CPU core at 100% is likely looping on a busy lock,
and when you issue "show fd" it requires all other threads to stop doing
anything, which the first one doesn't do, hence the totally locked situation
you're facing. Note that there are a few other situations where this could
possibly happen like a soft restart or servers changing state.

The CLOSE_WAIT state is expected in this case as it corresponds to closed
sockets that were handled by the blocked thread and that are thus not
being acknowledged as closed.

Next time it happens, it would be extremely useful if you could generate
a core using gdb ("generate-core-file") and share it along with your
executable (unstripped and built with -g please and if possible). Ideally
please issue this without attempting the show fd so that we can more
easily observe what thread is being blocked and try to figure why.

Thank you!
Willy



v1.9.6 high CPU usage

2019-04-10 Thread LCF
Hi,

I'm using haproxy 1.9.6 from deb repo:
#v+
~# haproxy -vv
HA-Proxy version 1.9.6-1ppa1~bionic 2019/03/30 - https://haproxy.org/
Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -O2
-fdebug-prefix-map=/build/haproxy-YXfmbO/haproxy-1.9.6=.
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
-D_FORTIFY_SOURCE=2 -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-format-truncation -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wno-implicit-fallthrough
-Wno-stringop-overflow -Wtype-limits -Wshift-negative-value
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_SYSTEMD=1 USE_PCRE2=1 USE_PCRE2_JIT=1 USE_NS=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.0g  2 Nov 2017
Running on OpenSSL version : OpenSSL 1.1.0g  2 Nov 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.31 2018-02-12
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with multi-threading support.

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as  cannot be specified using 'proto' keyword)
  h2 : mode=HTXside=FE|BE
  h2 : mode=HTTP   side=FE
: mode=HTXside=FE|BE
: mode=TCP|HTTP   side=FE|BE

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace
#v-

Every few days I see some servers with few hundreds connections in
CLOSE_WAIT state for hours. I tried suggested earlier here - "show fd" to
construct a bug report but whenever I run "show fd" (echo 'show fd' | socat
stdio /run/haproxy/haproxy.sock) all CPU cores are with 100% utilization
and haproxy is unresponsive (needs to be restarted).

I was able to run gdb before and after "show fd":
before:
#v+
(gdb) bt
#0  0x7fbb8fc36bb7 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x556f2801eb6c in _do_poll (p=, exp=144557181) at
src/ev_epoll.c:156
#2  0x556f280c2152 in run_poll_loop () at src/haproxy.c:2675
#3  run_thread_poll_loop (data=) at src/haproxy.c:2707
#4  0x556f2801c616 in main (argc=, argv=0x7ffd733e9da8)
at src/haproxy.c:3343
#v-
After "show fd"
#v+
(gdb) bt
#0  0x7fbb8fc18e57 in sched_yield () from
/lib/x86_64-linux-gnu/libc.so.6
#1  0x556f2815aef5 in thread_harmless_till_end () at src/hathreads.c:46
#2  0x556f2801f24e in thread_harmless_end () at
include/common/hathreads.h:367
#3  _do_poll (p=, exp=144595172) at src/ev_epoll.c:171
#4  0x556f280c2152 in run_poll_loop () at src/haproxy.c:2675
#5  run_thread_poll_loop (data=) at src/haproxy.c:2707
#6  0x556f2801c616 in main (argc=, argv=0x7ffd733e9da8)
at src/haproxy.c:3343
#v-

strace looks like that:
#v+
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
[pid 15426] sched_yield()   = 0
#v-

other example:
#v+
[pid  6102] recvfrom(1944, "show fd\n", 15360, 0, NULL, NULL) = 8
[pid  6102] recvfrom(1944, "", 15352, 0, NULL, NULL) = 0
[pid  6102] clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=1671,
tv_nsec=945933699}) = 0
[pid  6102] epoll_wait(101, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1
[pid  6102] clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=1671,
tv_nsec=945944207}) = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[pid  6102] sched_yield()   = 0
[…]
#v-

config:
#v+
global
  user haproxy
  group haproxy
  maxconn 20480
  daemon
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/certs

  stats socket /run/haproxy/haproxy.sock mode 660 level admin
  stats timeout 2m # Wait up to 2 minutes for input

  nbthread 32

  tune.ssl.default-dh-param 4096