v1.9.6 high CPU usage

LCF Wed, 10 Apr 2019 13:21:30 -0700

Hi,

I'm using haproxy 1.9.6 from deb repo:
#v+
~# haproxy -vv
HA-Proxy version 1.9.6-1ppa1~bionic 2019/03/30 - https://haproxy.org/
Build options :
  TARGET  = linux2628
  CPU     = generic
  CC      = gcc
  CFLAGS  = -O2 -g -O2
-fdebug-prefix-map=/build/haproxy-YXfmbO/haproxy-1.9.6=.
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
-D_FORTIFY_SOURCE=2 -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-format-truncation -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wno-implicit-fallthrough
-Wno-stringop-overflow -Wtype-limits -Wshift-negative-value
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1
USE_LUA=1 USE_SYSTEMD=1 USE_PCRE2=1 USE_PCRE2_JIT=1 USE_NS=1


Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.0g  2 Nov 2017
Running on OpenSSL version : OpenSSL 1.1.0g  2 Nov 2017
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with Lua version : Lua 5.3.3
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE2 version : 10.31 2018-02-12
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with multi-threading support.

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
              h2 : mode=HTX        side=FE|BE
              h2 : mode=HTTP       side=FE
       <default> : mode=HTX        side=FE|BE
       <default> : mode=TCP|HTTP   side=FE|BE

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace
#v-

Every few days I see some servers with few hundreds connections in
CLOSE_WAIT state for hours. I tried suggested earlier here - "show fd" to
construct a bug report but whenever I run "show fd" (echo 'show fd' | socat
stdio /run/haproxy/haproxy.sock) all CPU cores are with 100% utilization
and haproxy is unresponsive (needs to be restarted).

I was able to run gdb before and after "show fd":
before:
#v+
(gdb) bt
#0  0x00007fbb8fc36bb7 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000556f2801eb6c in _do_poll (p=<optimized out>, exp=144557181) at
src/ev_epoll.c:156
#2  0x0000556f280c2152 in run_poll_loop () at src/haproxy.c:2675
#3  run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2707
#4  0x0000556f2801c616 in main (argc=<optimized out>, argv=0x7ffd733e9da8)
at src/haproxy.c:3343
#v-
After "show fd"
#v+
(gdb) bt
#0  0x00007fbb8fc18e57 in sched_yield () from
/lib/x86_64-linux-gnu/libc.so.6
#1  0x0000556f2815aef5 in thread_harmless_till_end () at src/hathreads.c:46
#2  0x0000556f2801f24e in thread_harmless_end () at
include/common/hathreads.h:367
#3  _do_poll (p=<optimized out>, exp=144595172) at src/ev_epoll.c:171
#4  0x0000556f280c2152 in run_poll_loop () at src/haproxy.c:2675
#5  run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2707
#6  0x0000556f2801c616 in main (argc=<optimized out>, argv=0x7ffd733e9da8)
at src/haproxy.c:3343
#v-

strace looks like that:
#v+
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
[pid 15426] sched_yield()               = 0
#v-

other example:
#v+
[pid  6102] recvfrom(1944, "show fd\n", 15360, 0, NULL, NULL) = 8
[pid  6102] recvfrom(1944, "", 15352, 0, NULL, NULL) = 0
[pid  6102] clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=1671,
tv_nsec=945933699}) = 0
[pid  6102] epoll_wait(101, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1
[pid  6102] clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=1671,
tv_nsec=945944207}) = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[pid  6102] sched_yield()               = 0
[…]
#v-

config:
#v+
global
  user haproxy
  group haproxy
  maxconn 20480
  daemon
  ca-base /etc/ssl/certs
  crt-base /etc/ssl/certs

  stats socket /run/haproxy/haproxy.sock mode 660 level admin
  stats timeout 2m # Wait up to 2 minutes for input

  nbthread 32

  tune.ssl.default-dh-param 4096
  ssl-default-bind-ciphers
EECDH+AESGCM:AES256+EECDH:AES128+EECDH:EECDH:RSA+AES256+SHA:RSA+AES128+SHA
  ssl-default-bind-options ssl-min-ver TLSv1.2

defaults
  log 127.0.0.1 local1 notice
  mode http
  option httplog
  option log-health-checks
  option dontlognull
  option dontlog-normal
  retries 3
  option redispatch
  maxconn 20480
  timeout connect 60s
  timeout server 60s
  timeout client 60s

listen stats
  bind 10.12.70.29:81
  mode http
  stats enable
  stats refresh 5s
  stats uri /haproxy?stats
  stats realm Haproxy\ Statistics
  stats hide-version
  stats auth login:password

frontend http-in
  bind :80 v4v6
  bind :::80 v6only
  maxconn 30720

  http-request set-header X-Forwarded-For %[src]

  acl is_healthcheck path /__healthcheck

  use_backend healthcheck if is_healthcheck
  default_backend varnish-groups

frontend https-in
  bind :443 v4v6 ssl crt wildcard_example.com.pem alpn h2,http/1.1
  bind :::443 v6only ssl crt wildcard_example.com.pem alpn h2,http/1.1
  maxconn 30720

  http-request set-header X-Forwarded-For %[src]

  acl is_healthcheck path /__healthcheck

  use_backend healthcheck if is_healthcheck
  default_backend varnish-groups

backend healthcheck
  errorfile 503 /etc/haproxy/errors/200.http

backend varnish-groups
  balance uri
  hash-type consistent

  option httpchk HEAD /ping HTTP/1.1\r\nHost:\ example.com
  http-check expect status 200
  default-server inter 3s fall 3 rise 2

  server varnish 127.0.0.1:6081 check
#v-

The issue only appears when there are connection in CLOSE_WAIT state for
hours, normally "show fd" works fine.

Best
-- 
Łukasz Jagiełło
lukasz<at>jagiello<dot>org

v1.9.6 high CPU usage

Reply via email to