On Tue, Oct 27, 2015 at 3:41 AM, Ismail Pazarbasi <[email protected]> wrote: >>>> Hi, >>>> >>>> I've submitted some of the small patches to get TSan running on Mac OS >>>> X. Time for the real challenge is approaching, and that needs some >>>> discussion. >>>> >>>> The biggest problem with TSan on Darwin is the way thread local >>>> variables are created and destroyed. TSan uses a thread-local variable >>>> called "cur_thread_placeholder" to store each thread's state. Thread >>>> locals are initialized lazily on Darwin. >>>> >>>> Darwin dynamic linker uses pthread keys to maintain lifetime and >>>> storage for thread local variables. pthread keys are destroyed in the >>>> order of their construction, not the opposite direction; from lowest >>>> index to highest index. Per POSIX standard, this order is unspecified, >>>> but dyld code (and comments) explain the order. We want TSan's >>>> thread-locals to live until the end of execution, because we'll be >>>> accessing "cur_thread_placeholder" when thread exits. However, since >>>> TSan is initialized too early, it's taken down earlier than other >>>> thread locals. This is problematic, because destructor of these >>>> late-destroyed thread-locals may still call functions in TSan RTL >>>> (interposed functions or functions called from instrumented code, e.g. >>>> __tsan_func_entry). That either immediately crashes, because TSan is >>>> in invalid state or tries to "resurrect" TLV, which generally creates >>>> an infinite recursion, and end up with stack overflow. >>>> >>>> As far as I understood, dynamic linker creates a pthread key acting >>>> like a class "destructor" (quoting to distinguish it from pthread's >>>> destructor). When dyld loads image, it creates another key, whose >>>> destructor function frees the storage. At thread exit, it will call >>>> destructor functions from first to last. Therefore, the first >>>> "destructor" destructor function will be called before "finalize" >>>> destructor. After finalize, storage is freed, but can be resurrected >>>> if address of TLV is taken. >>>> >>>> I have found a workaround (or a solution -- you name it) that aims to >>>> keep early initialization behavior of TSan TLV keys, but postpone >>>> their destruction as much as possible. It seems like the only way to >>>> do this is to assign highest pthread key indices to dyld's TLV code >>>> when it allocates thread-local variables for TSan, and preserve their >>>> order. Because dyld will create keys in the right order; we just want >>>> to move those keys to the end of allocatable range. We have N pthread >>>> keys available in the range, and we need to allocate three keys at the >>>> highest possible indices, and order them. Two of these keys are used >>>> by TLV code (in dyld), and one is used internally in TSan. In this >>>> case, key N-2 is tlv_init, N-1 is finalizer, and N-3 is internal-use. >>>> Destructor functions for these keys will be called in ascending order, >>>> and TSan will work as expected in this case. >>>> >>>> TSan wasn't interposing pthread_key_create, but I had to highjack it >>>> on Darwin to achieve the behavior I explained. First, it identifies >>>> caller. If it's one of dyld's TLV (Thread Local Variable) functions, I >>>> call "pthread_key_create_high" to allocate highest possible key >>>> indices. It's basically a 'while' loop until real pthread_key_create >>>> returns EAGAIN. It then frees all other keys it has previously >>>> allocated. This places TSan's TLV keys to the end of destructor list. >>>> There is one more thing this function does; it stores pointer to >>>> original destructor function (passed in to pthread_key_create), and >>>> replaces it with its own destructor function, which will call >>>> 'pthread_setspecific' on the key PTHREAD_DESTRUCTOR_ITERATIONS - 1 >>>> times to postpone destruction of TSan TLV even further (IIRC, this was >>>> necessary). The replacement destructor function then calls the actual >>>> destructor function. >>>> >>>> As a side note, dyld uses libc++, and its loading the instrumented >>>> version. This didn't seem to be creating any problem so far. I didn't >>>> test this on a sufficiently large code base just yet. TSan tests >>>> generally pass, but they didn't have this "late-destroyed thread >>>> locals" problem I mentioned. >>>> >>>> Does my approach sound wrong, flawed or too brittle? Is there any >>>> other way to order these events in a sensible way? Can you help me >>>> fixing some of the problems I mentioned below, especially the >>>> 'identify_caller' function? >>>> >>>> The code is in a very dirty state now, so I refrain submitting it. >>>> But, no worries; there's always pseudo code! I'd like to share the key >>>> creation part, the rest of TSan patches are much simpler than this >>>> part. >>>> >>>> Thanks, >>>> Ismail >>>> >>>> #if SANITIZER_MAC >>>> enum { kPthHighIndexTlvInitializer = 0, >>>> kPthHighIndexTlvLoadNotification, kPthHighIndexTsanKeyCtor, >>>> kPthHighIndexTsanFinalize, >>>> kPthHighIndexCount }; >>>> >>>> // FIXME: This function is problematic. Can we avoid dladdr? >>>> static int identify_caller(uptr pc) { >>>> if (pc == (uptr)&__tsan::TsanGetOrCreateFinalizeKey) >>>> return kPthHighIndexTsanFinalize; >>>> // There are 2 "hidden" callers >>>> Dl_info dli; >>>> if (!dladdr((const void*)pc, &dli)) >>>> return -1; >>>> else if (!internal_strncmp(dli.dli_sname, "tlv_initializer", 15)) >>>> return kPthHighIndexTlvInitializer; >>>> else if (!internal_strncmp(dli.dli_sname, "tlv_load_notification", 21)) >>>> return kPthHighIndexTlvLoadNotification; >>>> return -1; >>>> } >>>> >>>> TSAN_INTERCEPTOR(int, pthread_key_create, unsigned long *key, >>>> void (*dtor)(void*)) { >>>> if (!COMMON_INTERCEPTOR_NOTHING_IS_INITIALIZED) { >>>> SCOPED_INTERCEPTOR_RAW(pthread_key_create, key, dtor); >>>> return REAL(pthread_key_create)(key, dtor); >>>> } >>>> const uptr caller_pc = GET_CALLER_PC(); >>>> const int caller_index = identify_caller(caller_pc); >>>> return -1 == caller_index ? REAL(pthread_key_create)(key, dtor) >>>> : pthread_key_create_high(key, dtor, >>>> caller_index); >>>> } >>>> >>>> // 'index' represents caller index so that we allocate them >>>> // in an order. We want to get the last N-index keys >>>> // in a sensible order. >>>> int __tsan::pthread_key_create_high(unsigned long *key, >>>> void (*dtor)(void *), int index) { >>>> int res = 0; >>>> original_dtors[index] = dtor; >>>> // Can be a function-local ulong[512]? >>>> unsigned long lowest_key, highest_key; >>>> >>>> // The following 2 loops may terminate early or later >>>> // depending on 'index'. >>>> do >>>> res = REAL(pthread_key_create)(&highest_key, replacement_dtor); >>>> while (res != EAGAIN); >>>> for (; lowest_key < highest_key; ++lowest_key) { >>>> pthread_key_delete(lowest_key); >>>> } >>>> tlv_keys[index] = highest_key; >>>> return res; >>>> } >>>> >>>> // This will be called from _pthread_tsd_cleanup >>>> static void replacement_dtor(void *p) { >>>> uptr cur_key; >>>> unsigned index = kPthHighIndexCount; >>>> // Can this make it through code review? >>>> __asm__ __volatile__ ("" : "=b" (cur_key)); >>>> for (unsigned c = kPthHighIndexTlvInitializer; c < kPthHighIndexCount; >>>> c++) { >>>> if (cur_key == tlv_keys[c]) { >>>> index = c; >>>> break; >>>> } >>>> } >>>> if (--tls_dtor_counters[index] == 1) { >>>> if (original_dtors[index]) >>>> original_dtors[index](p); >>>> return; >>>> } >>>> pthread_setspecific(tlv_keys[index], p); >>>> } >>>> #endif >>> >>> >>> >>> Hi Ismail, >>> >>> I have a counter-proposal that I wanted to implement for some time, >>> but never actually get to it because it was low-priority. >>> The problem that we had is with that PTHREAD_DESTRUCTOR_ITERATIONS-1 >>> postponing. There is other smart code that does the same and that also >>> breaks the assumption that tsan thread finalization runs last. >>> Another problem was with asan on mac. Essentially what you see, but in >>> asan it is easier to work around. We just ignore all memory accesses >>> that come after our thread destructor. >>> >>> The idea that I had is relatively clean, reasonably >>> platform-independent and should solve _all_ problems of such kind once >>> and for all. >>> Namely: we execute asan/tsan/msan thread destructor when the thread >>> _join_ returns. At that point there is definitely no thread code >>> running. The high-level implementation plan: for joinable threads we >>> hook into join (we already intercept it); for detached threads we >>> start a helper background thread when thread function returns and join >>> the user thread from that helper thread. I suspect that actual >>> implementation will be messier than that: for example, a thread can >>> call pthread_detach in a pthread_specific destructor after we've >>> decided to _not_ start a helper thread; or a thread can call >>> pthread_exit and does not return from thread function. >>> >>> What do you think about this idea? >> >> Another implementation plan can be to always join all threads >> ourselves. Then in pthread_detach merely remember that the thread is >> detached, but don't call REAL(pthread_detach). Then when we join a >> thread check whether it is detached or not; if it is detached then we >> can discard all thread state; if it is non-detached then we need to >> remember that the thread has finished and its return value, wait for >> user pthread_join call and satisfy it from our state. >> >> I don't know which plan will lead to simpler and more reliable >> implementation. >> > I didn't think thoroughly about this case. This sounds easier to > implement, but I am not sure if it'll work correctly (i.e. will it > miss problems that happen after TSan TLV dies, or will it report > false-positives, because TSan TLV dies too early). Still, we need to
Can we somehow get access to ThreadState from cur_thread() until we join the thread? That would be ideal and prevent false positives and negatives. The simplest way I can think of is to consult ThreadRegistriy using pthread_self() as the key to find current ThreadState if pthread_getspecific() returns NULL. > change a few things in TSan so that it tolerates missing > cur_thread_state. I forgot many details of TSan, I haven't been > hacking it for a long time. > >> To make it clear, what I don't like in your proposal: it is somewhat >> platform-dependent, should be enabled only on mac (thus less tested) >> and does not solve all potential problems of such kind (e.g. on some >> other OS there can be some other magical way to run user code after >> tsan thread destructor). > I agree; it's best to avoid platform-specific workarounds. > > Ismail -- You received this message because you are subscribed to the Google Groups "address-sanitizer" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
