On 05/10/13 02:31, Laurent Bourgès wrote:
Peter,

you're absolutely right: I was thinking about thread local values (object
instances) and not ThreadLocal keys !
I think ThreadLocal name is confusing as it does not correspond to values !

Several times I wonder if false sharing can happen between my thread local
values (i.e. different Thread context classes) and any other object
including other Thread contexts).

As Peter implied, this would in general be overkill. Every use
of @Contended should be an empirically guided time/space tradeoff.
There are specific classes used as ThreadLocals that may warrant this.
For example, java.util.concurrent.Exchanger has one.


Is the GC (old gen) able to place objects in thread dedicated area: it
would so avoid any false sharing between object graphs dedicated to each
thread = thread isolation.

No it doesn't. Some collectors use some heuristics that tend to
keep per-thread objects together, but there are no guarantees.

-Doug



I think that TLAB does so for allocation / short lived objects but for the
old generation (long lived objects) it is not the case: maybe G1 can
provide different partitioning and maybe take into acccount the thread
affinity ?

Laurent

2013/5/9 Peter Levart <peter.lev...@gmail.com>


On 05/09/2013 04:59 PM, Laurent Bourgès wrote:

Hi all,

A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that false
sharing never happens between ThreadLocal instance and any other object on
the heap ?


Hi Laurent,

ThreadLocal object is just a key (into a ThreadLocalMap). It's usually not
subclassed to add any state but to override initialValue method.
ThreadLocal contains a single final field 'threadLocalHashCode', which is
read at every call to ThreadLocal.get() (usually by multiple threads). This
can contend with a frequent write of a field in some other object, placed
into it's proximity, yes, but I don't think we should put @Contended on
every class that has frequently read fields. @Contended should be reserved
for classes with fields that are frequently written, if I understand the
concept correctly.

Regards, Peter


Laurent

  2013/5/9 Peter Levart <peter.lev...@gmail.com>

Hi Aleksey,

Wouldn't it be even better if just threadLocalRandom* fields were
annotated with @Contended("ThreadLocal") ?
Some fields within the Thread object are accessed from non-local threads.
I don't know how frequently, but isolating just threadLocalRandom* fields
from all possible false-sharing scenarios would seem even better, no?

Regards, Peter


On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:

Hi,

This is from our backlog after JDK-8005926. After ThreadLocalRandom
state was merged into Thread, we now have to deal with the false sharing
induced by heavily-updated fields in Thread. TLR was padded before, and
it should make sense to make Thread bear @Contended annotation to
isolate its fields in the same manner.

The webrev is here:
     http://cr.openjdk.java.net/~shade/8014233/webrev.00/

Testing:
   - microbenchmarks (see below)
   - JPRT cycle against jdk8-tl

The extended rationale for the change follows.

If we look at the current Thread layout, we can see the TLR state is
buried within the Thread instance. TLR state are by far the mostly
updated fields in Thread now:

  Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.

java.lang.Thread
    offset  size                     type description
         0    12                          (assumed to be the object
header + first field alignment)
        12     4                      int Thread.priority
        16     8                     long Thread.eetop
        24     8                     long Thread.stackSize
        32     8                     long Thread.nativeParkEventPointer
        40     8                     long Thread.tid
        48     8                     long Thread.threadLocalRandomSeed
        56     4                      int Thread.threadStatus
        60     4                      int Thread.threadLocalRandomProbe
        64     4                      int
Thread.threadLocalRandomSecondarySeed
        68     1                  boolean Thread.single_step
        69     1                  boolean Thread.daemon
        70     1                  boolean Thread.stillborn
        71     1                          (alignment/padding gap)
        72     4                   char[] Thread.name
        76     4                   Thread Thread.threadQ
        80     4                 Runnable Thread.target
        84     4              ThreadGroup Thread.group
        88     4              ClassLoader Thread.contextClassLoader
        92     4     AccessControlContext
Thread.inheritedAccessControlContext
        96     4           ThreadLocalMap Thread.threadLocals
       100     4           ThreadLocalMap Thread.inheritableThreadLocals
       104     4                   Object Thread.parkBlocker
       108     4            Interruptible Thread.blocker
       112     4                   Object Thread.blockerLock
       116     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
       120                                (object boundary, size
estimate)
   VM reports 120 bytes per instance


Assuming current x86 hardware with 64-byte cache line sizes and current
class layout, we can see the trailing fields in Thread are providing
enough insulation from the false sharing with an adjacent object. Also,
the Thread itself is large enough so that two TLRs belonging to
different threads will not collide.

However the leading fields are not enough: we have a few words which can
occupy the same cache line, but belong to another object. This is where
things can get worse in two ways: a) the TLR update can make the field
access in adjacent object considerably slower; and much worse b) the
update in the adjacent field can disturb the TLR state, which is
critical for j.u.concurrent performance relying heavily on fast TLR.

To illustrate both points, there is a simple benchmark driven by JMH
(http://openjdk.java.net/projects/code-tools/jmh/):
    http://cr.openjdk.java.net/~shade/8014233/threadbench.zip

On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
Thread with/without @Contended that microbenchmark yields the following
results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:

Accessing ThreadLocalRandom.current().nextInt():
    baseline:    932 +-  4 ops/usec
    @Contended:  927 +- 10 ops/usec

Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
    baseline:    454 +-  2 ops/usec
    @Contended:  490 +-  3 ops/usec

One might note the $uncaughtExceptionHandler is the trailing field in
the Thread, so it can naturally be false-shared with the adjacent
thread's TLR. We had chosen this as the illustration, in real examples
with multitude objects on the heap, we can get another contender.

So that is ~10% performance hit on false sharing even on very small
machine. Translating it back: having heavily-updated field in the object
adjacent to Thread can bring these overheads to TLR, and then jeopardize
j.u.c performance.

Of course, as soon as status quo about field layout is changed, we might
start to lose spectacularly. I would recommend we deal with this now, so
less surprises come in the future.

The caveat is that we are wasting some of the space per Thread instance.
After the patch, we layout is:

  java.lang.Thread
   offset  size                     type description
        0    12                          (assumed to be the object
header + first field alignment)
       12   128                          (alignment/padding gap)
      140     4                      int Thread.priority
      144     8                     long Thread.eetop
      152     8                     long Thread.stackSize
      160     8                     long Thread.nativeParkEventPointer
      168     8                     long Thread.tid
      176     8                     long Thread.threadLocalRandomSeed
      184     4                      int Thread.threadStatus
      188     4                      int Thread.threadLocalRandomProbe
      192     4                      int
Thread.threadLocalRandomSecondarySeed
      196     1                  boolean Thread.single_step
      197     1                  boolean Thread.daemon
      198     1                  boolean Thread.stillborn
      199     1                          (alignment/padding gap)
      200     4                   char[] Thread.name
      204     4                   Thread Thread.threadQ
      208     4                 Runnable Thread.target
      212     4              ThreadGroup Thread.group
      216     4              ClassLoader Thread.contextClassLoader
      220     4     AccessControlContext
Thread.inheritedAccessControlContext
      224     4           ThreadLocalMap Thread.threadLocals
      228     4           ThreadLocalMap Thread.inheritableThreadLocals
      232     4                   Object Thread.parkBlocker
      236     4            Interruptible Thread.blocker
      240     4                   Object Thread.blockerLock
      244     4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler
      248                                (object boundary, size estimate)
VM reports 376 bytes per instance

...and we have additional 256 bytes per Thread (twice the
-XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the
space wasted in native memory for each thread, especially stack areas.

Thanks,
Aleksey.








Reply via email to