On Thu, Dec 24, 2020 at 02:28:25PM +0100, Otto Moerbeek wrote:
> On Thu, Dec 24, 2020 at 01:59:34PM +0100, Mark Kettenis wrote:
>
> > > Date: Thu, 24 Dec 2020 12:27:01 +0100
> > > From: Otto Moerbeek <[email protected]>
> > >
> > > On Thu, Dec 24, 2020 at 10:36:37AM +0100, Otto Moerbeek wrote:
> > >
> > > > On Thu, Dec 24, 2020 at 08:25:45AM +0100, Uli Schlachter wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Am 24.12.20 um 04:35 schrieb Alexey Sokolov:
> > > > > [...]
> > > > > > Hi, getaddrinfo() crashes when multiple threads run getaddrinfo()
> > > > > > concurrently. This didn't happen in 6.7.
> > > > > > It looks like asr_ctx which is supposed to be thread-local
> > > > > > according to
> > > > > > _asr_use_resolver(), is actually static / shared between threads.
> > > > > >> How-To-Repeat:
> > > > > [...]
> > > > > > The reproducing code written by psychon, CCed, while debugging the
> > > > > > crash
> > > > > > of a more complicated software
> > > > > [...]
> > > > > here is a bit more information about what is going on:
> > > > >
> > > > > When run under egdb, you can "watch _asr". This reports:
> > > > >
> > > > > Old value = (struct asr *) 0x0
> > > > > New value = (struct asr *) 0x79905dce2a0
> > > > > _asr_use_resolver (arg=<optimized out>) at
> > > > > /usr/src/lib/libc/asr/asr.c:360
> > > > >
> > > > > (gdb) bt
> > > > > #0 _asr_use_resolver (arg=<optimized out>) at
> > > > > /usr/src/lib/libc/asr/asr.c:360
> > > > > #1 0x000007990558862e in _libc_res_init () at
> > > > > /usr/src/lib/libc/asr/res_init.c:44
> > > > > #2 0x000007990552c1e4 in _libc_getaddrinfo (hostname=0x796629c37a3
> > > > > "ipv4.google.com", servname=0x0, hints=0x7f7ffffce120,
> > > > > res=0x7f7ffffce158) at /usr/src/lib/libc/asr/getaddrinfo.c:36
> > > > > #3 0x00000796629c4d2e in do_lookup ()
> > > > > #4 0x00000796629c4da3 in do_things ()
> > > > > #5 0x00000796629c4ded in main ()
> > > > >
> > > > > At this time, the program is still single-threaded (as seen in the
> > > > > backtrace). This line in _asr_use_resolver() is the culprit:
> > > > >
> > > > > priv = _THREAD_PRIVATE(_asr, _asr, &_asr);
> > > > >
> > > > > the macro is defined as:
> > > > >
> > > > > #define _THREAD_PRIVATE(keyname, storage, error)
> > > > > (_thread_cb.tc_tag_storage == NULL ? &(storage) :
> > > > > _thread_cb.tc_tag_storage(&(__THREAD_NAME(keyname)),
> > > > > &(storage), sizeof(storage), error))
> > > > >
> > > > > gdb has to say about this:
> > > > >
> > > > > (gdb) print _thread_cb.tc_tag_storage
> > > > > $3 = (void *(*)(void **, void *, size_t, void *)) 0x0
> > > > >
> > > > > Thus, because we are still single-threaded, the program uses the
> > > > > static
> > > > > variable _asr directly to store its "per thread" resolver (priv =
> > > > > &_asr
> > > > > in the macro invocation above and thus *priv = _asr_resolver() ends up
> > > > > storing the resolver in a global variable).
> > > > >
> > > > > Later, threads are started. The next time that this code runs,
> > > > > _thread_cb.tc_tag_storage points to _thread_tag_storage() from
> > > > > lib/libc/thread/rthread_libc.c. This function uses memcpy() to
> > > > > initialise the per-thread variant of the variable with the contents of
> > > > > the global variant. Thus, all threads now end up using the resolver
> > > > > instance that the main thread created.
> > > > >
> > > > > I did not investigate what exactly goes wrong later, but since
> > > > > multiple
> > > > > threads are now sharing a resolver, nothing good can happen.
> > > > >
> > > > > The actual crash is a use-after-free (and it is always the same one):
> > > > > iter_family() in lib/libc/asr/getaddrinfo_async.c executes
> > > > > AS_FAMILY(as). This is:
> > > > >
> > > > > #define AS_FAMILY(p) ((p)->as_ctx->ac_family[(p)->as_family_idx])
> > > > >
> > > > > (gdb) print as->as_ctx->ac_family[as->as_family_idx]
> > > > > Cannot access memory at address 0x28b3653e000
> > > > >
> > > > > The struct as->as_ctx was overwritten with 0xdf. Here is a random
> > > > > struct
> > > > > member as an example:
> > > > >
> > > > > (gdb) print as->as_ctx[0]->ac_domain
> > > > > $8 = 0xdfdfdfdfdfdfdfdf <error: Cannot access memory at address
> > > > > 0xdfdfdfdfdfdfdfdf>
> > > > >
> > > > > At this point, as->as_ctx == _asr->a_ctx, but I am not sure how much
> > > > > this helps / surprises, since the crash happens in the main thread...
> > > > > I guess one of the threads freed the main thread's resolver in
> > > > > _asr_resolver_done()..?
> > > > >
> > > > > I guess that _asr_resolver_done() is supposed to set _asr back to a
> > > > > NULL
> > > > > pointer. However, a breakpoint on pthread_create() says that it is not
> > > > > doing so. In fact, I cannot find any callers of
> > > > > _asr_resolver_done()...?
> > > > > Perhaps some code was changed to directly call _asr_ctx_unref()
> > > > > instead?
> > > > > At least the code looks like asr_run() does _asr_async_free() at the
> > > > > end
> > > > > and this just calls _asr_ctx_unref() directly, even though the context
> > > > > was originally aquired by calling getaddrinfo_async() with last
> > > > > parameter NULL, which it passed on to _asr_use_resolver(NULL) and thus
> > > > > acquiring the thread-local context.
> > > > >
> > > > > Cheers,
> > > > > Uli
> > > > > --
> > > > > This can be a, a little complicated. Listen, my advice is... ask
> > > > > somebody else for advice, at least someone who's... got more
> > > > > experience
> > > > > at... giving advice.
> > > > >
> > > >
> > > > Thanks for this analysis, it's consistent with my observations I posted
> > > > earlier.
> > > >
> > > > -Otto
> > > >
> > >
> > > This works for me, but only lighlty tested. If this is an acceptable
> > > approach, some of the _rthread_init() machinery could likely be
> > > simplified.
> > >
> > > There might be pitfalls, though.
> >
> > Yes, I don't think that is a good idea.
> >
> > The problem isn't really the late initialization, but rather that the
> > ASR code doesn't use the _THREAD_PRIVATE() construction properly.
> >
> > We could work around that issue in the ASR code by doing something like:
> >
> > priv = _THREAD_PRIVATE(_asr, _asr, &_asr);
> > if (priv != &_asr && *priv == _asr)
> > *priv = NULL;
> > if (*priv == NULL) {
> > DPRINT("setting up thread-local resolver\n");
> > *priv = _asr_resolver();
> > }
> >
> > But maybe a better solution is to change _thread_tag_storage() to only
> > do the memcpy() if we're in the main thread and do a memset() if not.
>
> That would mean a change in the contract of _thread_tag_storage(). Now
> it is an internal function to libc luckily, so we are able to check if
> any caller depends on the current semantics.
>
> -Otto
>
The diff below attempt to follow the suggestion, but the test program
still crashes. I must be overlookign something but I think I won't be
able to spend much time on this the coming days.
-Otto
Index: rthread_libc.c
===================================================================
RCS file: /cvs/src/lib/libc/thread/rthread_libc.c,v
retrieving revision 1.3
diff -u -p -r1.3 rthread_libc.c
--- rthread_libc.c 10 Jan 2019 18:45:33 -0000 1.3
+++ rthread_libc.c 24 Dec 2020 14:46:20 -0000
@@ -5,6 +5,7 @@
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
+#include <tib.h>
#include "rthread.h"
#include "rthread_cb.h"
@@ -103,13 +104,16 @@ _thread_tag_storage(void **tag, void *st
ret = pthread_getspecific(tt->k);
if (ret == NULL) {
- ret = malloc(sz);
+ ret = calloc(1, sz);
if (ret == NULL)
ret = err;
else {
- if (pthread_setspecific(tt->k, ret) == 0)
- memcpy(ret, storage, sz);
- else {
+ if (pthread_setspecific(tt->k, ret) == 0) {
+ if (!__isthreaded ||
+ (TIB_GET()->tib_thread_flags &
+ TIB_THREAD_INITIAL_STACK))
+ memcpy(ret, storage, sz);
+ } else {
free(ret);
ret = err;
}