Hi Maciej,

On Fri, May 10, 2019 at 06:45:21PM +0200, Maciej Zdeb wrote:
> Olivier, it's still looping, but differently:
> 
> 2609        list_for_each_entry_safe(h2s, h2s_back, &h2c->send_list, list) {
> (gdb) n
> 2610            if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb)
(...)
> (gdb) p *h2s
> $1 = {cs = 0x2f84190, sess = 0x819580 <pool_cache+96>, h2c = 0x2f841a0, h1m
> = {state = 48, flags = 0, curr_len = 38317, body_len = 103852, next = 413,
> err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x34c0260,
>           0x321d330}}, node_p = 0x0, leaf_p = 0x0, bit = 1, pfx = 47005},
> key = 3}, id = 3, flags = 28675, mws = 1017461, errcode = H2_ERR_NO_ERROR,
> st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area =
> 0x0, data = 0,
>     head = 0}, wait_event = {task = 0x2cd0ed0, handle = 0x3, events = 0},
> recv_wait = 0x0, send_wait = 0x321d390, list = {n = 0x321d3b8, p =
> 0x321d3b8}, sending_list = {n = 0x3174cf8, p = 0x3174cf8}}
(...)

In fact it's exactly the same. I've analyzed the core you sent me (thanks
a lot for this by the way), and it explains very well why it's looping,
though I don't yet understand how we entered this situation. It also
explains how it can randomly crash instead of looping.

What happens is that the item you're inspecting above has an empty
list element which points to itself. Thus it doesn't belong to a list
and its next is itself. But it's still attached to its predecessor. In
your core file it was very visible as the h2c list's head points to the
element and the list tail's tail points to it as well. So what we're
seeing is an element that was *not* unlinked from a list before being
removed / recycled. Once it was reused, it was initialized with an
empty list header and the next time the connection's list was visited,
it caused the loop.

What I find very strange is why you're possibly the only one seeing this
(and maybe also @serimin on github issue #94). If we could figure what
makes your case specific it could help narrow the issue down. I'm seeing
that you have a very simple Lua service to respond to health checks, so
I've been thinking that maybe we do have some remaining bugs when Lua is
accessed over H2 (e.g. incorrect length or whatever), but it's not the
case on your connection since there are something like 17 streams so we
can rule out the hypothesis of a health check connection, and thus that
Lua was used.

I'm back to auditing the code to figure how we can free an h2s without
first detaching it from the lists. I hope to have yet another patch to
propose to you.

Thanks again for your traces, they've been amazingly useful!
Willy

Reply via email to