On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
Hi all,

A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false sharing never happens between ThreadLocal instance and any other object on the heap ?


Hi Laurent,

ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not subclassed to add any state but to override initialValue method. ThreadLocal contains a single final field 'threadLocalHashCode', which is read at every call to ThreadLocal.get() (usually by multiple threads). This can contend with a frequent write of a field in some other object, placed into it's proximity, yes, but I don't think we should put @Contended on every class that has frequently read fields. @Contended should be reserved for classes with fields that are frequently written, if I understand the concept correctly.

Regards, Peter

Laurent

2013/5/9 Peter Levart <peter.lev...@gmail.com <mailto:peter.lev...@gmail.com>>

    Hi Aleksey,

    Wouldn't it be even better if just threadLocalRandom* fields were
    annotated with @Contended("ThreadLocal") ?
    Some fields within the Thread object are accessed from non-local
    threads. I don't know how frequently, but isolating just
    threadLocalRandom* fields from all possible false-sharing
    scenarios would seem even better, no?

    Regards, Peter


    On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:

        Hi,

        This is from our backlog after JDK-8005926. After
        ThreadLocalRandom
        state was merged into Thread, we now have to deal with the
        false sharing
        induced by heavily-updated fields in Thread. TLR was padded
        before, and
        it should make sense to make Thread bear @Contended annotation to
        isolate its fields in the same manner.

        The webrev is here:
        http://cr.openjdk.java.net/~shade/8014233/webrev.00/
        <http://cr.openjdk.java.net/%7Eshade/8014233/webrev.00/>

        Testing:
          - microbenchmarks (see below)
          - JPRT cycle against jdk8-tl

        The extended rationale for the change follows.

        If we look at the current Thread layout, we can see the TLR
        state is
        buried within the Thread instance. TLR state are by far the mostly
        updated fields in Thread now:

            Running 64-bit HotSpot VM.
            Using compressed references with 3-bit shift.
            Objects are 8 bytes aligned.

            java.lang.Thread
               offset  size                     type description
                    0    12                          (assumed to be
            the object header + first field alignment)
                   12     4                      int Thread.priority
                   16     8                     long Thread.eetop
                   24     8                     long Thread.stackSize
                   32     8                     long
            Thread.nativeParkEventPointer
                   40     8                     long Thread.tid
                   48     8                     long
            Thread.threadLocalRandomSeed
                   56     4                      int Thread.threadStatus
                   60     4                      int
            Thread.threadLocalRandomProbe
                   64     4                      int
            Thread.threadLocalRandomSecondarySeed
                   68     1                  boolean Thread.single_step
                   69     1                  boolean Thread.daemon
                   70     1                  boolean Thread.stillborn
                   71     1  (alignment/padding gap)
                   72     4                   char[] Thread.name
                   76     4                   Thread Thread.threadQ
                   80     4                 Runnable Thread.target
                   84     4              ThreadGroup Thread.group
                   88     4              ClassLoader
            Thread.contextClassLoader
                   92     4     AccessControlContext
            Thread.inheritedAccessControlContext
                   96     4           ThreadLocalMap Thread.threadLocals
                  100     4           ThreadLocalMap
            Thread.inheritableThreadLocals
                  104     4                   Object Thread.parkBlocker
                  108     4            Interruptible Thread.blocker
                  112     4                   Object Thread.blockerLock
                  116     4 UncaughtExceptionHandler
            Thread.uncaughtExceptionHandler
                  120                                (object boundary,
            size estimate)
              VM reports 120 bytes per instance


        Assuming current x86 hardware with 64-byte cache line sizes
        and current
        class layout, we can see the trailing fields in Thread are
        providing
        enough insulation from the false sharing with an adjacent
        object. Also,
        the Thread itself is large enough so that two TLRs belonging to
        different threads will not collide.

        However the leading fields are not enough: we have a few words
        which can
        occupy the same cache line, but belong to another object. This
        is where
        things can get worse in two ways: a) the TLR update can make
        the field
        access in adjacent object considerably slower; and much worse
        b) the
        update in the adjacent field can disturb the TLR state, which is
        critical for j.u.concurrent performance relying heavily on
        fast TLR.

        To illustrate both points, there is a simple benchmark driven
        by JMH
        (http://openjdk.java.net/projects/code-tools/jmh/):
        http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
        <http://cr.openjdk.java.net/%7Eshade/8014233/threadbench.zip>

        On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
        Thread with/without @Contended that microbenchmark yields the
        following
        results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:

        Accessing ThreadLocalRandom.current().nextInt():
           baseline:    932 +-  4 ops/usec
           @Contended:  927 +- 10 ops/usec

        Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
           baseline:    454 +-  2 ops/usec
           @Contended:  490 +-  3 ops/usec

        One might note the $uncaughtExceptionHandler is the trailing
        field in
        the Thread, so it can naturally be false-shared with the adjacent
        thread's TLR. We had chosen this as the illustration, in real
        examples
        with multitude objects on the heap, we can get another contender.

        So that is ~10% performance hit on false sharing even on very
        small
        machine. Translating it back: having heavily-updated field in
        the object
        adjacent to Thread can bring these overheads to TLR, and then
        jeopardize
        j.u.c performance.

        Of course, as soon as status quo about field layout is
        changed, we might
        start to lose spectacularly. I would recommend we deal with
        this now, so
        less surprises come in the future.

        The caveat is that we are wasting some of the space per Thread
        instance.
        After the patch, we layout is:

            java.lang.Thread
              offset  size                     type description
                   0    12                          (assumed to be the
            object header + first field alignment)
                  12   128  (alignment/padding gap)
                 140     4                      int Thread.priority
                 144     8                     long Thread.eetop
                 152     8                     long Thread.stackSize
                 160     8                     long
            Thread.nativeParkEventPointer
                 168     8                     long Thread.tid
                 176     8                     long
            Thread.threadLocalRandomSeed
                 184     4                      int Thread.threadStatus
                 188     4                      int
            Thread.threadLocalRandomProbe
                 192     4                      int
            Thread.threadLocalRandomSecondarySeed
                 196     1                  boolean Thread.single_step
                 197     1                  boolean Thread.daemon
                 198     1                  boolean Thread.stillborn
                 199     1  (alignment/padding gap)
                 200     4                   char[] Thread.name
                 204     4                   Thread Thread.threadQ
                 208     4                 Runnable Thread.target
                 212     4              ThreadGroup Thread.group
                 216     4              ClassLoader
            Thread.contextClassLoader
                 220     4     AccessControlContext
            Thread.inheritedAccessControlContext
                 224     4           ThreadLocalMap Thread.threadLocals
                 228     4           ThreadLocalMap
            Thread.inheritableThreadLocals
                 232     4                   Object Thread.parkBlocker
                 236     4            Interruptible Thread.blocker
                 240     4                   Object Thread.blockerLock
                 244     4 UncaughtExceptionHandler
            Thread.uncaughtExceptionHandler
                 248                                (object boundary,
            size estimate)
            VM reports 376 bytes per instance

        ...and we have additional 256 bytes per Thread (twice the
        -XX:ContendedPaddingWidth, actually). Seems irrelevant
        comparing to the
        space wasted in native memory for each thread, especially
        stack areas.

        Thanks,
        Aleksey.




Reply via email to