Hi, I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not sufficient for this debugging process.
With the patch applied I tried again to use spoa from
"contrib/spoa_example/". Example spoa agent does not understand my
spoe-message and silently ignores it, but it doesn't matter.
To trigger segmentation fault I must reload HAProxy (when using spoa from
an external vendor this additional reload wasn't necessary, I've just had
to wait a couple seconds to trigger crash).
Usually HAProxy crashes on process_stream, but once it crashed at
(long)h2s->subs & 1 check in testcorrupt during spoe_release_appctx
#0 0x00005597450c25f9 in testcorrupt (ptr=0x7f4fb8071990) at
src/mux_h2.c:6238
cs = 0x7f4fb8071990
h2s = 0x7f4fe85751f0
#1 0x0000559745196239 in spoe_release_appctx (appctx=0x7f4fe8324e00) at
src/flt_spoe.c:1294
si = 0x7f4fe82b31f8
spoe_appctx = 0x7f4fe88dd760
agent = 0x559746052580
ctx = 0x7f4fe8380b80
back = 0x559746355b38
Then I tried again to replicate the bug on my r&d server this time making
HAProxy reloads (multiple times) during the test and it crashed.
HAProxy was compiled with git HEAD set to
77015abe0bcfde67bff519b1d48393a513015f77 with patch
0001-EXP-try-to-spot-where-h2s-subs-changes-V2.patch applied
and with modified h2s:
diff --git a/src/mux_h2.c b/src/mux_h2.c
index 9928b32c7..3d5187271 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -206,6 +206,8 @@ struct h2s {
uint16_t status; /* HTTP response status */
unsigned long long body_len; /* remaining body length according to
content-length if H2_SF_DATA_CLEN */
struct buffer rxbuf; /* receive buffer, always valid (buf_empty or real
buffer) */
+ struct tasklet *dummy0;
+ struct wait_event *dummy1;
struct wait_event *subs; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
struct tasklet *shut_tl; /* deferred shutdown tasklet, to retry to send
an RST after we failed to,
Attached:
haproxy.cfg (/etc/haproxy/haproxy.cfg main config)
spoe-example.conf (/etc/haproxy/spoe-example.conf spoe config)
I used spoa from contrib/spoa_example run with command:
"./spoa -p 4545 -c fragmentation -c async -c pipelining"
I used vegeta to generate traffic: https://github.com/tsenart/vegeta with
command:
"cat input | ./vegeta attack -duration=360s -insecure -keepalive=false
-http2=true -rate=500/1s > /dev/null"
I used 2 virtual machines to generate traffic and additionally I've
launched vegeta on host with HAProxy
where input file is:
GET https://haproxy-crash.test.local/
zdebek:
sdofijdsoifjodisjfoisdjfoisdovisoivjdfoijvoisdjvopsdijg0934u49032ut09gir09j40g9u0492it093i2g09i0r9bi2490ib094i0b9i09i0924it09bi2222itk42jh09tj4309sdfjdlsjfoadiwe9023i0r92094i4309gi0934ig9034ig093i4g90i3409gi3409gi0394ig0934i0g93jjoujgiurhjgiuerhgiurehgiuerhg89489u098u509u09wrut0923ej23fjjsufdsuf98dusf98u98u2398uf9834uf983u49f8h98huish9fsdu98fusd98uf982u398u3298ru2938uffhsdijhfisdjhiusdhfiu2iuhf2398289823189831893198931udashidsah
The reloaded HAProxy configuration (multiple times, again and again until
segmentation fault occurred):
haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 10608
pon., 9 lis 2020 o 16:01 Maciej Zdeb <[email protected]> napisał(a):
> It crashed now on first test in process_stream:
>
> struct task *process_stream(struct task *t, void *context, unsigned short
> state)
> {
> struct server *srv;
> struct stream *s = context;
> struct session *sess = s->sess;
> unsigned int rqf_last, rpf_last;
> unsigned int rq_prod_last, rq_cons_last;
> unsigned int rp_cons_last, rp_prod_last;
> unsigned int req_ana_back;
> struct channel *req, *res;
> struct stream_interface *si_f, *si_b;
> unsigned int rate;
>
> TEST_STRM(s);
> [...]
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0 0x000055f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> [Current thread is 1 (Thread 0x7f75a98b9700 (LWP 5860))]
> (gdb) bt full
> #0 0x000055f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> cs = 0x7f75ac1ed990
> h2s = 0x7f7584244510
> #1 0x000055f4cdad8993 in process_stream (t=0x7f75ac139d70,
> context=0x7f7588066540, state=260) at src/stream.c:1499
> srv = 0x7f75a9896390
> s = 0x7f7588066540
> sess = 0x7f759c071b80
> rqf_last = 4294967294
> rpf_last = 2217468112
> rq_prod_last = 32629
> rq_cons_last = 2217603024
> rp_cons_last = 32629
> rp_prod_last = 2217182865
> req_ana_back = 2217603025
> req = 0x7f75a9896350
> res = 0x55f4cdbed618 <__task_queue+92>
> si_f = 0x55f4ce03c680 <task_per_thread+896>
> si_b = 0x7f75842def80
> rate = 2217603024
> #2 0x000055f4cdbeddb2 in run_tasks_from_list (list=0x55f4ce03c6c0
> <task_per_thread+960>, max=150) at src/task.c:371
> process = 0x55f4cdad892d <process_stream>
> t = 0x7f75ac139d70
> state = 260
> ctx = 0x7f7588066540
> done = 3
> [...]
>
> subs is 0xffffffff like before BUT dummy1 is also changed to 0xffff
>
> (gdb) p *(struct h2s*)(0x7f7584244510)
> $1 = {cs = 0x7f75ac1ed990, sess = 0x55f4ce02be40 <pool_cache+7328>, h2c =
> 0x7f758417abd0, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
> body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = {
> branches = {b = {0x7f758428e430, 0x7f7584244550}}, node_p =
> 0x7f758428e431, leaf_p = 0x7f7584244551, bit = 1, pfx = 33828}, key = 23},
> id = 23, flags = 16385, sws = 0, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
> status = 0, body_len = 0, rxbuf = {size = 16384, area = 0x7f75780a2210
> "Ð?", data = 16384, head = 0}, dummy0 = 0x0, dummy1 = 0xffff, subs =
> 0xffffffff, list = {n = 0x7f75842445c8, p = 0x7f75842445c8},
> shut_tl = 0x7f75842df0d0}
>
> pon., 9 lis 2020 o 15:07 Christopher Faulet <[email protected]>
> napisał(a):
>
>> Le 09/11/2020 à 13:10, Maciej Zdeb a écrit :
>> > I've played little bit with the patch and it led me to backend.c file
>> and
>> > connect_server() function
>> >
>> > int connect_server(struct stream *s)
>> > {
>> > [...]
>> > if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
>> > /* set the correct protocol on the output stream
>> interface */
>> > if (srv)
>> > conn_prepare(srv_conn,
>> > protocol_by_family(srv_conn->dst->ss_family), srv->xprt);
>> > else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
>> > /* proxies exclusively run on raw_sock right
>> now */
>> > conn_prepare(srv_conn,
>> > protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));
>> > if (!(srv_conn->ctrl)) {
>> > conn_free(srv_conn);
>> > return SF_ERR_INTERNAL;
>> > }
>> > }
>> > else {
>> > conn_free(srv_conn);
>> > return SF_ERR_INTERNAL; /* how did we get
>> there ? */
>> > }
>> > // THIS ONE IS OK
>> > TEST_STRM(s);
>> > //////////////////////////////
>> > srv_cs = si_alloc_cs(&s->si[1], srv_conn);
>> > // FAIL
>> > TEST_STRM(s);
>> > //////////////////////////////
>> > if (!srv_cs) {
>> > conn_free(srv_conn);
>> > return SF_ERR_RESOURCE;
>> > }
>>
>> Hi,
>>
>> In fact, this crash occurs because of the Willy's patch. It was not
>> design to
>> handle non-h2 connections. Here the crash happens on a TCP connection,
>> used by a
>> SPOE applet for instance.
>>
>> I updated its patch. First, I added some calls to TEST_STRM() in the SPOE
>> code,
>> to be sure. I also explicitly set the stream task to NULL in
>> stream_free() to
>> catch late wakeups in the SPOE. Finally, I modified testcorrupt(). I hope
>> this
>> one is correct. But if I missed something, you may only keep the last
>> ABORT_NOW() in testcorrupt() and replace others by a return statement,
>> just like
>> in the Willy's patch.
>>
>> --
>> Christopher Faulet
>>
>
spoe-example.conf
Description: Binary data
haproxy.cfg
Description: Binary data

