Hi Aleksey

Well the code change is easy enough to review :)

As to the effects ... no way to judge that: time and testing will tell.

David

On 9/05/2013 3:29 AM, Aleksey Shipilev wrote:
Hi,

This is from our backlog after JDK-8005926. After ThreadLocalRandom
state was merged into Thread, we now have to deal with the false sharing
induced by heavily-updated fields in Thread. TLR was padded before, and
it should make sense to make Thread bear @Contended annotation to
isolate its fields in the same manner.

The webrev is here:
    http://cr.openjdk.java.net/~shade/8014233/webrev.00/

Testing:
  - microbenchmarks (see below)
  - JPRT cycle against jdk8-tl

The extended rationale for the change follows.

If we look at the current Thread layout, we can see the TLR state is
buried within the Thread instance. TLR state are by far the mostly
updated fields in Thread now:

Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.

java.lang.Thread
   offset  size                     type description
        0    12                          (assumed to be the object header + 
first field alignment)
       12     4                      int Thread.priority
       16     8                     long Thread.eetop
       24     8                     long Thread.stackSize
       32     8                     long Thread.nativeParkEventPointer
       40     8                     long Thread.tid
       48     8                     long Thread.threadLocalRandomSeed
       56     4                      int Thread.threadStatus
       60     4                      int Thread.threadLocalRandomProbe
       64     4                      int Thread.threadLocalRandomSecondarySeed
       68     1                  boolean Thread.single_step
       69     1                  boolean Thread.daemon
       70     1                  boolean Thread.stillborn
       71     1                          (alignment/padding gap)
       72     4                   char[] Thread.name
       76     4                   Thread Thread.threadQ
       80     4                 Runnable Thread.target
       84     4              ThreadGroup Thread.group
       88     4              ClassLoader Thread.contextClassLoader
       92     4     AccessControlContext Thread.inheritedAccessControlContext
       96     4           ThreadLocalMap Thread.threadLocals
      100     4           ThreadLocalMap Thread.inheritableThreadLocals
      104     4                   Object Thread.parkBlocker
      108     4            Interruptible Thread.blocker
      112     4                   Object Thread.blockerLock
      116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
      120                                (object boundary, size estimate)
  VM reports 120 bytes per instance


Assuming current x86 hardware with 64-byte cache line sizes and current
class layout, we can see the trailing fields in Thread are providing
enough insulation from the false sharing with an adjacent object. Also,
the Thread itself is large enough so that two TLRs belonging to
different threads will not collide.

However the leading fields are not enough: we have a few words which can
occupy the same cache line, but belong to another object. This is where
things can get worse in two ways: a) the TLR update can make the field
access in adjacent object considerably slower; and much worse b) the
update in the adjacent field can disturb the TLR state, which is
critical for j.u.concurrent performance relying heavily on fast TLR.

To illustrate both points, there is a simple benchmark driven by JMH
(http://openjdk.java.net/projects/code-tools/jmh/):
   http://cr.openjdk.java.net/~shade/8014233/threadbench.zip

On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
Thread with/without @Contended that microbenchmark yields the following
results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:

Accessing ThreadLocalRandom.current().nextInt():
   baseline:    932 +-  4 ops/usec
   @Contended:  927 +- 10 ops/usec

Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
   baseline:    454 +-  2 ops/usec
   @Contended:  490 +-  3 ops/usec

One might note the $uncaughtExceptionHandler is the trailing field in
the Thread, so it can naturally be false-shared with the adjacent
thread's TLR. We had chosen this as the illustration, in real examples
with multitude objects on the heap, we can get another contender.

So that is ~10% performance hit on false sharing even on very small
machine. Translating it back: having heavily-updated field in the object
adjacent to Thread can bring these overheads to TLR, and then jeopardize
j.u.c performance.

Of course, as soon as status quo about field layout is changed, we might
start to lose spectacularly. I would recommend we deal with this now, so
less surprises come in the future.

The caveat is that we are wasting some of the space per Thread instance.
After the patch, we layout is:

java.lang.Thread
  offset  size                     type description
       0    12                          (assumed to be the object header + 
first field alignment)
      12   128                          (alignment/padding gap)
     140     4                      int Thread.priority
     144     8                     long Thread.eetop
     152     8                     long Thread.stackSize
     160     8                     long Thread.nativeParkEventPointer
     168     8                     long Thread.tid
     176     8                     long Thread.threadLocalRandomSeed
     184     4                      int Thread.threadStatus
     188     4                      int Thread.threadLocalRandomProbe
     192     4                      int Thread.threadLocalRandomSecondarySeed
     196     1                  boolean Thread.single_step
     197     1                  boolean Thread.daemon
     198     1                  boolean Thread.stillborn
     199     1                          (alignment/padding gap)
     200     4                   char[] Thread.name
     204     4                   Thread Thread.threadQ
     208     4                 Runnable Thread.target
     212     4              ThreadGroup Thread.group
     216     4              ClassLoader Thread.contextClassLoader
     220     4     AccessControlContext Thread.inheritedAccessControlContext
     224     4           ThreadLocalMap Thread.threadLocals
     228     4           ThreadLocalMap Thread.inheritableThreadLocals
     232     4                   Object Thread.parkBlocker
     236     4            Interruptible Thread.blocker
     240     4                   Object Thread.blockerLock
     244     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
     248                                (object boundary, size estimate)
VM reports 376 bytes per instance

...and we have additional 256 bytes per Thread (twice the
-XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
space wasted in native memory for each thread, especially stack areas.

Thanks,
Aleksey.

Reply via email to