Re: (L) Prelim RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

David Holmes Tue, 10 Nov 2015 03:27:35 -0800

Sorry the formatting of the replies is getting totally screwed up now :(


On 10/11/2015 8:20 PM, Thomas Stüfe wrote:

Hi David,

On Fri, Nov 6, 2015 at 11:20 PM, David Holmes <[email protected]
<mailto:[email protected]>> wrote:

    On 6/11/2015 9:52 PM, Thomas Stüfe wrote:

        Hi David,

        On Fri, Nov 6, 2015 at 7:26 AM, David Holmes
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

             Hi Jeremy,

             Okay I have read:

        https://sourceware.org/glibc/wiki/TLSandSignals

             and the tree of mail postings referenced therefrom - great
        reading! :)

             So basic problem: access to __thread variables is not
        async-signal-safe

             Exacerbator to problem: first attempt to even read a __thread
             variable can lead to allocation which is the real problem in
             relation to async-signal-safety

             I mention the exacerbator because pthread_getspecific and
             pthread_setSpecific are also not async-signal-safe but we
        already
             use them. However, pthread_getspecific is in fact (per
        email threads
             linked above) effectively async-signal-safe, and further a
        call to
             pthread_getspecific never results in a call to
        pthread_setspecific
             or an allocation. Hence the pthread functions are almost,
        if not
             completely, safe in practice with reasonable uses (ie only
        read from
             signal handler). Which explain this code in existing
        Thread::current()

             #ifdef PARANOID
                // Signal handler should call
        ThreadLocalStorage::get_thread_slow()
                Thread* t = ThreadLocalStorage::get_thread_slow();
                assert(t != NULL && !t->is_inside_signal_handler(),
                       "Don't use Thread::current() inside signal handler");
             #endif

             So problem scenario is: use of __thread variable (that
        belongs to
             the shared-library) in a signal handler.

             Solution 0: don't do that. Seriously - like any other
             async-signal-unsafe stuff we should not be using it in real
        signal
             handlers. The crash handler is a different matter - we try
        all sorts
             there because it might work and you can't die twice.

             Otherwise: narrow the window of exposure.

             1. We ensure we initialize thread_current (even with a
        dummy value)
             as early as possible in the thread that loads libjvm. As we
        have no
             signal handlers installed at that point that might use the same
             variable, we can not hit the problem scenario.

             2. We ensure we initialize thread_current in a new thread
        with all
             signals blocked. This again avoids the problem scenario.

             3. We initialize thread_current in an attaching thread as
        soon as
             possible and we again first block all signals.

             That still leaves the problem of an unattached native
        thread taking
             a signal whilst in async-signal-unsafe code, and executing
        a signal
             handler which in turns tries to access thread_current for
        the first
             time. This signal handler need not be an actual JVM
        handler, but one
             attached by other native code eg an agent. I'm not clear in the
             latter case how reasonable it is for an agent's handler to
        try and
             do things from an unattached thread - and we don't claim
        any JNI
             interfaces can, or should, be called from a signal handler
        - but it
             is something you can probably get away with today.

             Let me also point out that we already effectively have this
        code in
             Solaris already (at the ThreadLocalStorage class level). So
        if there
             is something here that will prevent the current proposal we
        already
             have a problem on Solaris. :(

             Thoughts/comments/suggestions?


        The first problem: thread initializes TLS variable, gets
        interrupted and
        accesses the half-initialized variable from within the signal
        handler.
        This could happen today too, or? but I think we never saw this.


    That depends on the state of signal masks at the time of the
    initialization. For threads created in the VM and for threads
    attached to the VM it is likely not an issue. Unattached threads
    could in theory try to access a TLS variable from a signal handler,
    but they will never be initializing that variable. Of course the
    unattached thread could be initializing a completely distinct TLS
    variable, but reading a different TLS variable from the signal
    handler does not seem to be an issue (in theory it may be but this
    is an extreme corner case).

        In theory, it could be mitigated by some careful testing before
        using
        the Thread::current() value in the signal handler. Like, put an
        eyecatcher at the beginning of the Thread structure and check
        that using
        SafeFetch.


    There is no way to access the Thread structure before calling
    Thread::current(). And the potential problem is with unattached
    threads which have no Thread structure. For threads attached to the
    VM, or attaching, my three steps will deal with any potential problems.

        As for the second problem - recursive malloc() deadlocks - I am at a
        loss. I do not fully understand though why pthread_getspecific is
        different - does it not have to allocate place for the TLS
        variable too?


    No, pthread_getspecific does not have to allocate. Presumably it is
    written in a way that attempting to index a TLS variable that has
    not been allocated just returns an error (EINVAL?).


It would return NULL.

    The problem with __thread is that even a read will attempt to do the
    allocation - arguably (as the Google folk did argue) this is wrong,
    or else should be done in an async-safe way.


I looked up the implementation of pthread_getspecific and
pthread_setspecific in the glibc and now understand better why pthread
tls is considered safe here.

glibc allows for 1024 tls slots which are organized as a 32x32 sparse
array, whose second level arrays are only allocated if the first slot in
it is used by pthread_setspecific. So, only when writing the slot. It
also means that the number of allocation calls is smaller than with
__thread - instead of (presumably) calling malloc() for every instance
of __thread, it only calls at the maximum 32 times. And the first 32
slots are already allocated in the pthread structure, so they are free.
This means that even if one were to write to a TLS slot in the signal
handler, chances of it mallocing are quite small.

__thread will only need to malloc (in the context we are discussing)when a pre-existing thread first references a TLS variable from a lazilyloaded DSO. Otherwise the space for the DSO's TLS variables areallocated "statically" when the thread control structures are created.


    This does leave me wondering exactly what affect the:

    static __thread Thread* _thr_current = NULL;

    has in terms of any per-thread allocation. ??

    Anyway to reiterate the problem scenario:
    - VM has been loaded in a process and signal handlers have been
    installed (maybe VM, maybe agent)
    - unattached thread is doing a malloc when it takes a signal
    - signal handler tries to read __thread variable and we get a malloc
    deadlock

    As I said I need to determine what signal handlers in the VM might
    ever run on an unattached thread, and what they might do.


I don't understand - our signal handler is globally active, no? So any
unattached thread may execute our signal handler at any time, and the
first thing our signal handler does is Thread::current(). If there was a
third party signal handler, it is getting called as chained handler, but
only after our signal handler ran.

The current code uses ThreadLocalStorage::get_thread_slow() in thesignal handler, which uses pthread_getspecific, which is "safe". The newcode would access the __thread variable and have to malloc - which isunsafe.

Using the JDK launchers there are no unattached threads created beforelibjvm is loaded - so the problem would never arise.

I have been looking hard at our signal handlers though and found theydon't seem to match how they are described in various parts of the codethat set the signal masks. My main concern is with process-directedsignals (SIGQUIT, SIGTERM) that trigger specific actions (thread dump,orderly shutdown). Synchronous signals are not an issue - if theunattached thread triggers a segv while in malloc then the process isdoomed anyway and a deadlock is less problematic (not great but hardlyin the same league as deadlocking an active fully working VM). But I'mstill having trouble joining all the dots in this code and figuring outhow an unattached thread might react today. I'll continue untanglingthis tomorrow.

Thanks, Thomas

(My current feeling is that I'd prefer to keep the pthread TLS solution
but I like your simplifications to the code and would like to keep that
too...)

It was all the complex, inconsistent caching mechanisms employed overthe top of the library based TLS that motivated the cleanup - especiallyas the cache on Solaris was shown to be broken. If it was just a simplelibrary based TLS layer, there would be less motivation to try the__thread approach - but __thread had the appeal of removing a lot ofduplicated code. A simple library based scheme might be an alternativeif it is performant enough - but not sure I have the time to go back tosquare one on this.


Thanks,
David

    For a "third-party" signal handler there's really nothing I can do -
    they should not be accessing the VM's __thread variables though (and
    they cal always introduce their own independent deadlocks by
    performing non-async-safe actions).

    Thanks,
    David

        Regards, Thomas


             Thanks,
             David


             On 6/11/2015 1:09 PM, David Holmes wrote:

                 Hi Jeremy,

                 I was going to ask you to elaborate :)

                 On 6/11/2015 12:24 PM, Jeremy Manson wrote:

                     I should probably elaborate on this.  With glibc +
        ELF, the
                     first time a
                     thread accesses a variable declared __thread, if that
                     variable is in a
                     shared library (as opposed to the main executable), the
                     system calls
                     malloc() to allocate the space for it.  If that
        happens in a
                     signal that
                     is being delivered during a call to malloc(), then you
                     usually get a
                     crash.


                 My understanding of the ELF ABI for thread-locals -
        which I read
                 about
                 in the Solaris 11.1 Linkers and libraries guide - does
        require
                 use of
                 the dynamic TLS model for any dynamically loaded shared
        object which
                 defines a thread-local, but that is what we use as I
        understand
                 it. The
                 docs state:

                 "A shared object containing only dynamic TLS can be
        loaded following
                 process startup without limitations. The runtime linker
        extends
                 the list
                 of initialization records to include the initialization
        template
                 of the
                 new object. The new object is given an index of m = M +
        1. The
                 counter M is incremented by 1. However, the allocation
        of new
                 TLS blocks
                 is deferred until the blocks are actually referenced."

                 Now I guess "extends the list" might be implemented
        using malloc
                 ... but
                 this will only occur in the main thread (the one
        started by the
                 launcher
                 to load the JVM and become the main thread), at the
        time libjvm is
                 loaded - which will all be over before any agent etc
        can run and do
                 anything. But "allocation ... is deferred" suggests we
        may have a
                 problem until either the first call to Thread::current
        or the
                 call to
                 Thread::initialize_thread_current. If it is the former
        then that
                 should
                 occur well before any agent etc can be loaded. And I
        can easily
                 inject
                 an initial dummy call to initialize_thread_current(null) to
                 force the
                 TLS allocation.

                     This may bite you if AsyncGetCallTrace uses
                     Thread::current(), and you
                     use system timers to do profiling.  If a thread doing a
                     malloc() prior
                     to the first time it accesses Thread::current(),
        and it gets
                     delivered a
                     signal, it might die.  This is especially likely
        for pure
                     native threads
                     started by native code.

                     I believe that this is a use case you support, so
        you might
                     want to make
                     sure it is okay.


                 For a VM embedded in a process, which already contains
        native
                 threads,
                 that will later attach to the VM, this may indeed be a
        problem. One
                 would have hoped however that the implementation of TLS
        would be
                 completely robust, at least for something as simple as
        getting a
                 signal
                 whilst in the allocator.

                 I'm unclear how to test for or check for this kind of
        problem.
                 Arguably
                 there could be many things that are async-unsafe in
        this way.

                 Need to think more about this and do some more
        research. Would also
                 appreciate any insight from any glibc and/or ELF gurus.

                 Thanks.
                 David

                     Jeremy

                     On Thu, Nov 5, 2015 at 5:58 PM, Jeremy Manson
                     <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
                     <mailto:[email protected]
        <mailto:[email protected]>
                     <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:

                          Something that's bitten me with __thread: it isn't
                     async-safe when
                          called from a shared object on Linux.  Have
        you vetted
                     to make sure
                          this doesn't make HS less async-safe?

                          Jeremy

                          On Sun, Nov 1, 2015 at 10:40 PM, David Holmes
                          <[email protected]
        <mailto:[email protected]>
                     <mailto:[email protected]
        <mailto:[email protected]>>
                     <mailto:[email protected]
        <mailto:[email protected]>

                     <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:

                              bug:
        https://bugs.openjdk.java.net/browse/JDK-8132510

                              Open webrev:
        http://cr.openjdk.java.net/~dholmes/8132510/webrev.v2/

                              A simple (in principle) but wide-ranging
        change
                     which should
                              appeal to our Code Deletion Engineer's. We
        implement
                              Thread::current() using a
        compiler/language-based
                     thread-local
                              variable eg:


                                static __thread Thread *_thr_current;

                                inline Thread* Thread::current() {
                                  return _thr_current;
                                }

                              with an appropriate setter of course. By
        doing this
                     we can
                              completely remove the platform-specific
                     ThreadLocalStorage
                              implementations, and the associated
                     os::thread_local_storage*
                              calls, plus all the uses of
                     ThreadLocalStorage::thread() and
                              ThreadLocalStorage::get_thread_slow().
        This extends
                     the previous
                              work done on Solaris to implement
                     ThreadLocalStorage::thread()
                              using compiler-based thread-locals.

                              We can also consolidate nearly all the os_cpu
                     versions of
                              MacroAssembler::get_thread on x86 into one cpu
                     specific one ( a
                              special variant is still needed for 32-bit
        Windows).

                              As a result of this change we have further
                     potential cleanups:
                              - all the
        src/os/<os>/vm/thread_<os>.inline.hpp
                     files are now
                              completely empty and could also be removed
                              - the MINIMIZE_RAM_USAGE define (which
        avoids use
                     of the linux
                              sp-map "cache" on 32-bit) now has no
        affect and so
                     could be
                              completely removed from the build system

                              I plan to do the MINIMIZE_RAM_USAGE
        removal as a
                     follow up CR,
                              but could add the removal of the "inline"
        files to
                     this CR if
                              people think it worth removing them.

                              I have one missing piece on Aarch64 - I
        need to change
                              MacroAssembler::get_thread to simply call
                     Thread::current() as
                              on other platforms, but I don't know how
        to write
                     that. I would
                              appreciate it if someone could give me the
        right
                     code for that.

                              I would also appreciate comments/testing
        by the AIX
                     and PPC64
                              folk as well.

                              A concern about memory-leaks had
        previously been
                     raised, but
                              experiments using simple C code on linux
        86 and
                     Solaris showed
                              no issues. Also note that Aarch64 already
        uses this
                     kind of
                              thread-local.

                              Thanks,
                              David

Re: (L) Prelim RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Reply via email to