We are running HAProxy 1.9.6 and managed to get into a state where
HAProxy was completely unresponsive. It was pegged at 100% like many of
the other experiences here on the mailing list lately. But in addition
it wouldn't respond to anything. The stats socket wasn't even responsive.
When I attached an strace, it sat there with no activity. When I
attached GDB I got the following stack:
(gdb) bt full
#0 htx_get_head (htx=0x7fbeb666eba0) at include/common/htx.h:357
No locals.
#1 h2s_htx_make_trailers (h2s=h2s@entry=0x7fbeb625f9f0,
htx=htx@entry=0x7fbeb666eba0) at src/mux_h2.c:4975
list = {{n = {ptr = 0x0, len = 0}, v = {ptr =
0x0, len = 0}} <repeats 101 times>}
h2c = 0x7fbeb6372320
blk = <optimized out>
blk_end = 0x0
outbuf = {size = 140722044755807, area = 0x0,
data = 140457080712096, head = 140457060939041}
h1m = {state = H1_MSG_HDR_NAME, flags = 2056,
curr_len = 140457077580664, body_len = 16384, next = 2, err_pos = 0,
err_state = -1237668736}
type = <optimized out>
ret = 0
hdr = 0
idx = <optimized out>
start = <optimized out>
#2 0x00007fbeb50f2ef5 in h2_snd_buf (cs=0x7fbeb63ea9a0,
buf=0x7fbeb6127048, count=2, flags=<optimized out>) at src/mux_h2.c:5372
h2s = <optimized out>
orig_count = <optimized out>
total = 15302
ret = <optimized out>
htx = 0x7fbeb666eba0
blk = <optimized out>
btype = <optimized out>
idx = <optimized out>
#3 0x00007fbeb5180be4 in si_cs_send (cs=0x7fbeb63ea9a0) at
src/stream_interface.c:691
send_flag = <optimized out>
conn = 0x7fbeb6051a70
si = 0x7fbeb6127268
oc = 0x7fbeb6127040
ret = <optimized out>
did_send = 0
#4 0x00007fbeb51817c8 in si_update_both
(si_f=si_f@entry=0x7fbeb6127268, si_b=si_b@entry=0x7fbeb61272a8) at
src/stream_interface.c:850
req = 0x7fbeb6126fe0
res = <optimized out>
cs = <optimized out>
#5 0x00007fbeb50ea2e1 in process_stream (t=<optimized out>,
context=0x7fbeb6126fd0, state=<optimized out>) at src/stream.c:2502
srv = <optimized out>
s = 0x7fbeb6126fd0
sess = <optimized out>
rqf_last = <optimized out>
rpf_last = 3255042562
rq_prod_last = <optimized out>
rq_cons_last = <optimized out>
rp_cons_last = 7
rp_prod_last = 7
req_ana_back = <optimized out>
req = 0x7fbeb6126fe0
res = 0x7fbeb6127040
si_f = 0x7fbeb6127268
si_b = 0x7fbeb61272a8
#6 0x00007fbeb51b20a8 in process_runnable_tasks () at
src/task.c:434
t = <optimized out>
state = <optimized out>
ctx = <optimized out>
process = <optimized out>
t = <optimized out>
max_processed = <optimized out>
#7 0x00007fbeb512b6ff in run_poll_loop () at src/haproxy.c:2642
next = <optimized out>
exp = <optimized out>
#8 run_thread_poll_loop (data=data@entry=0x7fbeb5d84620) at
src/haproxy.c:2707
ptif = <optimized out>
ptdf = <optimized out>
start_lock = 0
#9 0x00007fbeb507d2b5 in main (argc=<optimized out>,
argv=0x7ffc677d73b8) at src/haproxy.c:3343
tids = 0x7fbeb5d84620
threads = 0x7fbeb5eb6d90
i = <optimized out>
old_sig = {__val = {68097, 0, 511101108338, 0,
140722044760335, 140457059422467, 140722044760392, 140454020513805, 124,
140457064304960, 390842023936, 140457064395072, 48, 140457035994976,
18446603351664791121, 140454020513794}}
---Type <return> to continue, or q <return> to quit---
blocked_sig = {__val = {18446744067199990583,
18446744073709551615 <repeats 15 times>}}
err = <optimized out>
retry = <optimized out>
limit = {rlim_cur = 131300, rlim_max = 131300}
errmsg =
"\000@\000\000\000\000\000\000\002\366\210\263\276\177\000\000\300\364m\265\276\177\000\000`\227\274\263\276\177\000\000\030\000\000\000\000\000\000\000>\001\000\024\000\000\000\000p$o\265\276\177\000\000@>k\265\276\177\000\000\000\320$\265\276\177\000\000\274\276\177\000\000
t}g\374\177\000\000\000\000\000\000\000\000\000\000P\367m\265"
pidfd = <optimized out>
Our config is big and complex, and not something I want to post here (I
may be able to provide directly if required). However I think the
important bit is that we we have a frontend and backend which are used
for load balancing gRPC traffic (thus h2). The backend servers are h2c
(no SSL).
The service has been restarted, so it cannot be probed any more. However
I did capture a core file before doing so.
-Patrick