On Thu, Dec 24, 2020 at 10:36:37AM +0100, Otto Moerbeek wrote:
> On Thu, Dec 24, 2020 at 08:25:45AM +0100, Uli Schlachter wrote:
>
> > Hi everyone,
> >
> > Am 24.12.20 um 04:35 schrieb Alexey Sokolov:
> > [...]
> > > Hi, getaddrinfo() crashes when multiple threads run getaddrinfo()
> > > concurrently. This didn't happen in 6.7.
> > > It looks like asr_ctx which is supposed to be thread-local according to
> > > _asr_use_resolver(), is actually static / shared between threads.
> > >> How-To-Repeat:
> > [...]
> > > The reproducing code written by psychon, CCed, while debugging the crash
> > > of a more complicated software
> > [...]
> > here is a bit more information about what is going on:
> >
> > When run under egdb, you can "watch _asr". This reports:
> >
> > Old value = (struct asr *) 0x0
> > New value = (struct asr *) 0x79905dce2a0
> > _asr_use_resolver (arg=<optimized out>) at /usr/src/lib/libc/asr/asr.c:360
> >
> > (gdb) bt
> > #0 _asr_use_resolver (arg=<optimized out>) at
> > /usr/src/lib/libc/asr/asr.c:360
> > #1 0x000007990558862e in _libc_res_init () at
> > /usr/src/lib/libc/asr/res_init.c:44
> > #2 0x000007990552c1e4 in _libc_getaddrinfo (hostname=0x796629c37a3
> > "ipv4.google.com", servname=0x0, hints=0x7f7ffffce120,
> > res=0x7f7ffffce158) at /usr/src/lib/libc/asr/getaddrinfo.c:36
> > #3 0x00000796629c4d2e in do_lookup ()
> > #4 0x00000796629c4da3 in do_things ()
> > #5 0x00000796629c4ded in main ()
> >
> > At this time, the program is still single-threaded (as seen in the
> > backtrace). This line in _asr_use_resolver() is the culprit:
> >
> > priv = _THREAD_PRIVATE(_asr, _asr, &_asr);
> >
> > the macro is defined as:
> >
> > #define _THREAD_PRIVATE(keyname, storage, error)
> > (_thread_cb.tc_tag_storage == NULL ? &(storage) :
> > _thread_cb.tc_tag_storage(&(__THREAD_NAME(keyname)),
> > &(storage), sizeof(storage), error))
> >
> > gdb has to say about this:
> >
> > (gdb) print _thread_cb.tc_tag_storage
> > $3 = (void *(*)(void **, void *, size_t, void *)) 0x0
> >
> > Thus, because we are still single-threaded, the program uses the static
> > variable _asr directly to store its "per thread" resolver (priv = &_asr
> > in the macro invocation above and thus *priv = _asr_resolver() ends up
> > storing the resolver in a global variable).
> >
> > Later, threads are started. The next time that this code runs,
> > _thread_cb.tc_tag_storage points to _thread_tag_storage() from
> > lib/libc/thread/rthread_libc.c. This function uses memcpy() to
> > initialise the per-thread variant of the variable with the contents of
> > the global variant. Thus, all threads now end up using the resolver
> > instance that the main thread created.
> >
> > I did not investigate what exactly goes wrong later, but since multiple
> > threads are now sharing a resolver, nothing good can happen.
> >
> > The actual crash is a use-after-free (and it is always the same one):
> > iter_family() in lib/libc/asr/getaddrinfo_async.c executes
> > AS_FAMILY(as). This is:
> >
> > #define AS_FAMILY(p) ((p)->as_ctx->ac_family[(p)->as_family_idx])
> >
> > (gdb) print as->as_ctx->ac_family[as->as_family_idx]
> > Cannot access memory at address 0x28b3653e000
> >
> > The struct as->as_ctx was overwritten with 0xdf. Here is a random struct
> > member as an example:
> >
> > (gdb) print as->as_ctx[0]->ac_domain
> > $8 = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address
> > 0xdfdfdfdfdfdfdfdf>
> >
> > At this point, as->as_ctx == _asr->a_ctx, but I am not sure how much
> > this helps / surprises, since the crash happens in the main thread...
> > I guess one of the threads freed the main thread's resolver in
> > _asr_resolver_done()..?
> >
> > I guess that _asr_resolver_done() is supposed to set _asr back to a NULL
> > pointer. However, a breakpoint on pthread_create() says that it is not
> > doing so. In fact, I cannot find any callers of _asr_resolver_done()...?
> > Perhaps some code was changed to directly call _asr_ctx_unref() instead?
> > At least the code looks like asr_run() does _asr_async_free() at the end
> > and this just calls _asr_ctx_unref() directly, even though the context
> > was originally aquired by calling getaddrinfo_async() with last
> > parameter NULL, which it passed on to _asr_use_resolver(NULL) and thus
> > acquiring the thread-local context.
> >
> > Cheers,
> > Uli
> > --
> > This can be a, a little complicated. Listen, my advice is... ask
> > somebody else for advice, at least someone who's... got more experience
> > at... giving advice.
> >
>
> Thanks for this analysis, it's consistent with my observations I posted
> earlier.
>
> -Otto
>
This works for me, but only lighlty tested. If this is an acceptable
approach, some of the _rthread_init() machinery could likely be
simplified.
There might be pitfalls, though.
-Otto
Index: rthread.c
===================================================================
RCS file: /cvs/src/lib/librthread/rthread.c,v
retrieving revision 1.99
diff -u -p -r1.99 rthread.c
--- rthread.c 4 Nov 2017 22:53:57 -0000 1.99
+++ rthread.c 24 Dec 2020 11:23:04 -0000
@@ -190,6 +190,12 @@ _thread_key_zero(int key)
}
}
+static void __attribute__ ((constructor)) rthread_ct(void);
+static void rthread_ct(void)
+{
+ _rthread_init();
+}
+
void
_rthread_init(void)
{