Re: (L) Prelim RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

David Holmes Mon, 09 Nov 2015 16:01:16 -0800

On 9/11/2015 8:54 AM, David Holmes wrote:

On 7/11/2015 11:22 AM, Jeremy Manson wrote:



On Thu, Nov 5, 2015 at 11:48 PM, David Holmes <[email protected]
<mailto:[email protected]>> wrote:

    On 6/11/2015 4:55 PM, Jeremy Manson wrote:

        FWIW, Google tried to convince the glibc maintainers to make this
        async-safe, but they weren't biting:

        https://sourceware.org/ml/libc-alpha/2014-01/msg00033.html


    Yes I read all that. I wouldn't say they weren't biting, more of a
    disagreement on the right direction for the patch. glibc weren't
    prepared to take it directly as is, while google-folk didn't seem to
    think it worth their while to do whatever glibc folk wanted. The
    actual patch proposal just died out. :( Quite a pity.

        Most of the things you can do are going to be mitigation rather
        than a
        fix.  I did what you suggest to mitigate, and no one complained,
        until
        someone at Google started a sidecar C++ thread that did a
        boatload of
        malloc'ing.


    Yes all mitigation. :(

        My workaround was far sneakier, and fixes the problem entirely,
        but you
        probably won't want to do it for code hygiene reasons.  I
        declare the
        __thread variable in the java launcher itself, and then export a
        function that provides a pointer to it.  In practice, in glibc,
        if it is
        in the main executable, ELF is smart enough to declare it as
        part of the
        stack, and is therefore async-safe.


    But even that only works for our own launchers - not for embedded in
    the JVM. :(


Yup.  Fortunately, I can tell people at Google how to write launchers.

        Fortunately, while this is a fun thing to think about, I don't
think
        there are any async paths in the JVM that use Thread::current()
        (although I could be wrong - there is a comment in there saying
        not to
        call Thread::current() in a signal handler...).  I would check
        the call
        paths in AsyncGetCallTrace to make sure.


    So two things ...

    First, using Thread::current() in a signal context was disallowed,
    but the alternative was ThreadLocalStorage::get_thread_slow(). The
    former may not work in a signal context due to the caching
    mechanisms layered in on different platforms, while the latter used
    the platform TLS API which, even if not guaranteed, worked well
    enough in a signal context. With __thread we don't have even a
    pretend signal-safe alternative :(


Right.

    Second, AsyncGetCallTrace violates the first rule by using
    JavaThread::current() in an assertion.


While we're on the subject, the assertion in Method::bci_from is
reachable from AsyncGetCallTrace and calls err_msg, which can malloc().
I meant to file a bug about that.

    Also, the problem may not be limited to something like
    AsyncGetCallTrace. Though agents may get the current thread from the
    JNIEnv rather than invoking some JVM function that results in
    Thread::current() being used, I can't be sure of that.


Which JVM functions that get the thread are supposed to be async-safe?
There is no reason to think that any method that isn't explicitly marked
async-safe is async safe, and most JNI methods I've tried to use from
signal handlers die painfully if I try to use them from a signal handler.

Generally, I don't think it is reasonable for a user to expect
async-safety from an API that isn't expressly designed that way.  POSIX
has a list of async-safe methods (signal(7)).


Right - no JNI or JVM TI functions are designated as async-signal-safe
(the specs dont even mention signals).

Unfortunately my problem is more basic: pretty much the first thing the
JVM signal handler does is get the current thread. So if the signal is
handled on an unattached thread that happened to be doing a malloc then
we're toast. :( Most of the signals the JVM expects to handle are not
blocked by default, AFAICS, so any unattached thread could be selected.

Just to keep my thinking straight on this, the problem only exists forthreads that existed before the JVM was loaded. All threads allocatedafter that will have space for all the TLS variables allocated directly.So the problem scenario is:


- external process with existing threads loads the JVM

- existing thread is executing critical library function eg malloc, whenit takes a process-directed signal.- JVM signal handler runs and accesses _thr_current which triggersdynamic TLS allocation


David
-----

David
-----

FWIW, to use AsyncGetCallTrace, I get the JNIEnv in a ThreadStart hook
from JVMTI and stash it in a __thread (and pull the trick I mentioned).
Jeremy


    Anyway more things to mull over on the weekedn. :)

    Thanks,
    David

        Jeremy

        On Thu, Nov 5, 2015 at 10:26 PM, David Holmes
        <[email protected] <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

             Hi Jeremy,

             Okay I have read:

        https://sourceware.org/glibc/wiki/TLSandSignals

             and the tree of mail postings referenced therefrom - great
        reading! :)

             So basic problem: access to __thread variables is not
        async-signal-safe

             Exacerbator to problem: first attempt to even read a
__thread
             variable can lead to allocation which is the real problem in
             relation to async-signal-safety

             I mention the exacerbator because pthread_getspecific and
             pthread_setSpecific are also not async-signal-safe but we
        already
             use them. However, pthread_getspecific is in fact (per
        email threads
             linked above) effectively async-signal-safe, and further a
        call to
             pthread_getspecific never results in a call to
        pthread_setspecific
             or an allocation. Hence the pthread functions are almost,
        if not
             completely, safe in practice with reasonable uses (ie only
        read from
             signal handler). Which explain this code in existing
        Thread::current()

             #ifdef PARANOID
                // Signal handler should call
        ThreadLocalStorage::get_thread_slow()
                Thread* t = ThreadLocalStorage::get_thread_slow();
                assert(t != NULL && !t->is_inside_signal_handler(),
                       "Don't use Thread::current() inside signal
handler");
             #endif

             So problem scenario is: use of __thread variable (that
        belongs to
             the shared-library) in a signal handler.

             Solution 0: don't do that. Seriously - like any other
             async-signal-unsafe stuff we should not be using it in real
        signal
             handlers. The crash handler is a different matter - we try
        all sorts
             there because it might work and you can't die twice.

             Otherwise: narrow the window of exposure.

             1. We ensure we initialize thread_current (even with a
        dummy value)
             as early as possible in the thread that loads libjvm. As we
        have no
             signal handlers installed at that point that might use
the same
             variable, we can not hit the problem scenario.

             2. We ensure we initialize thread_current in a new thread
        with all
             signals blocked. This again avoids the problem scenario.

             3. We initialize thread_current in an attaching thread as
        soon as
             possible and we again first block all signals.

             That still leaves the problem of an unattached native
        thread taking
             a signal whilst in async-signal-unsafe code, and executing
        a signal
             handler which in turns tries to access thread_current for
        the first
             time. This signal handler need not be an actual JVM
        handler, but one
             attached by other native code eg an agent. I'm not clear
in the
             latter case how reasonable it is for an agent's handler to
        try and
             do things from an unattached thread - and we don't claim
        any JNI
             interfaces can, or should, be called from a signal handler
        - but it
             is something you can probably get away with today.

             Let me also point out that we already effectively have this
        code in
             Solaris already (at the ThreadLocalStorage class level). So
        if there
             is something here that will prevent the current proposal we
        already
             have a problem on Solaris. :(

             Thoughts/comments/suggestions?

             Thanks,
             David


             On 6/11/2015 1:09 PM, David Holmes wrote:

                 Hi Jeremy,

                 I was going to ask you to elaborate :)

                 On 6/11/2015 12:24 PM, Jeremy Manson wrote:

                     I should probably elaborate on this.  With glibc +
        ELF, the
                     first time a
                     thread accesses a variable declared __thread, if
that
                     variable is in a
                     shared library (as opposed to the main
executable), the
                     system calls
                     malloc() to allocate the space for it.  If that
        happens in a
                     signal that
                     is being delivered during a call to malloc(),
then you
                     usually get a
                     crash.


                 My understanding of the ELF ABI for thread-locals -
        which I read
                 about
                 in the Solaris 11.1 Linkers and libraries guide - does
        require
                 use of
                 the dynamic TLS model for any dynamically loaded shared
        object which
                 defines a thread-local, but that is what we use as I
        understand
                 it. The
                 docs state:

                 "A shared object containing only dynamic TLS can be
        loaded following
                 process startup without limitations. The runtime linker
        extends
                 the list
                 of initialization records to include the initialization
        template
                 of the
                 new object. The new object is given an index of m = M +
        1. The
                 counter M is incremented by 1. However, the allocation
        of new
                 TLS blocks
                 is deferred until the blocks are actually referenced."

                 Now I guess "extends the list" might be implemented
        using malloc
                 ... but
                 this will only occur in the main thread (the one
        started by the
                 launcher
                 to load the JVM and become the main thread), at the
        time libjvm is
                 loaded - which will all be over before any agent etc
        can run and do
                 anything. But "allocation ... is deferred" suggests we
        may have a
                 problem until either the first call to Thread::current
        or the
                 call to
                 Thread::initialize_thread_current. If it is the former
        then that
                 should
                 occur well before any agent etc can be loaded. And I
        can easily
                 inject
                 an initial dummy call to
initialize_thread_current(null) to
                 force the
                 TLS allocation.

                     This may bite you if AsyncGetCallTrace uses
                     Thread::current(), and you
                     use system timers to do profiling.  If a thread
doing a
                     malloc() prior
                     to the first time it accesses Thread::current(),
        and it gets
                     delivered a
                     signal, it might die.  This is especially likely
        for pure
                     native threads
                     started by native code.

                     I believe that this is a use case you support, so
        you might
                     want to make
                     sure it is okay.


                 For a VM embedded in a process, which already contains
        native
                 threads,
                 that will later attach to the VM, this may indeed be a
        problem. One
                 would have hoped however that the implementation of TLS
        would be
                 completely robust, at least for something as simple as
        getting a
                 signal
                 whilst in the allocator.

                 I'm unclear how to test for or check for this kind of
        problem.
                 Arguably
                 there could be many things that are async-unsafe in
        this way.

                 Need to think more about this and do some more
        research. Would also
                 appreciate any insight from any glibc and/or ELF gurus.

                 Thanks.
                 David

                     Jeremy

                     On Thu, Nov 5, 2015 at 5:58 PM, Jeremy Manson
                     <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
                     <mailto:[email protected]
        <mailto:[email protected]>
                     <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:

                          Something that's bitten me with __thread: it
isn't
                     async-safe when
                          called from a shared object on Linux.  Have
        you vetted
                     to make sure
                          this doesn't make HS less async-safe?

                          Jeremy

                          On Sun, Nov 1, 2015 at 10:40 PM, David Holmes
                          <[email protected]
        <mailto:[email protected]>
                     <mailto:[email protected]
        <mailto:[email protected]>>
                     <mailto:[email protected]
        <mailto:[email protected]>

                     <mailto:[email protected]
        <mailto:[email protected]>>>> wrote:

                              bug:
        https://bugs.openjdk.java.net/browse/JDK-8132510

                              Open webrev:
        http://cr.openjdk.java.net/~dholmes/8132510/webrev.v2/

                              A simple (in principle) but wide-ranging
        change
                     which should
                              appeal to our Code Deletion Engineer's. We
        implement
                              Thread::current() using a
        compiler/language-based
                     thread-local
                              variable eg:


                                static __thread Thread *_thr_current;

                                inline Thread* Thread::current() {
                                  return _thr_current;
                                }

                              with an appropriate setter of course. By
        doing this
                     we can
                              completely remove the platform-specific
                     ThreadLocalStorage
                              implementations, and the associated
                     os::thread_local_storage*
                              calls, plus all the uses of
                     ThreadLocalStorage::thread() and
                              ThreadLocalStorage::get_thread_slow().
        This extends
                     the previous
                              work done on Solaris to implement
                     ThreadLocalStorage::thread()
                              using compiler-based thread-locals.

                              We can also consolidate nearly all the
os_cpu
                     versions of
                              MacroAssembler::get_thread on x86 into
one cpu
                     specific one ( a
                              special variant is still needed for 32-bit
        Windows).

                              As a result of this change we have further
                     potential cleanups:
                              - all the
        src/os/<os>/vm/thread_<os>.inline.hpp
                     files are now
                              completely empty and could also be removed
                              - the MINIMIZE_RAM_USAGE define (which
        avoids use
                     of the linux
                              sp-map "cache" on 32-bit) now has no
        affect and so
                     could be
                              completely removed from the build system

                              I plan to do the MINIMIZE_RAM_USAGE
        removal as a
                     follow up CR,
                              but could add the removal of the "inline"
        files to
                     this CR if
                              people think it worth removing them.

                              I have one missing piece on Aarch64 - I
        need to change
                              MacroAssembler::get_thread to simply call
                     Thread::current() as
                              on other platforms, but I don't know how
        to write
                     that. I would
                              appreciate it if someone could give me the
        right
                     code for that.

                              I would also appreciate comments/testing
        by the AIX
                     and PPC64
                              folk as well.

                              A concern about memory-leaks had
        previously been
                     raised, but
                              experiments using simple C code on linux
        86 and
                     Solaris showed
                              no issues. Also note that Aarch64 already
        uses this
                     kind of
                              thread-local.

                              Thanks,
                              David

Re: (L) Prelim RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Reply via email to