On 05/09/2013 04:59 PM, Laurent Bourgès wrote:
Hi all,
A stupid question:
any ThreadLocal subclass should be marked @Contended to be sure that
false sharing never happens between ThreadLocal instance and any other
object on the heap ?
Hi Laurent,
ThreadLocal object is just a key (into a ThreadLocalMap). It's usually
not subclassed to add any state but to override initialValue method.
ThreadLocal contains a single final field 'threadLocalHashCode', which
is read at every call to ThreadLocal.get() (usually by multiple
threads). This can contend with a frequent write of a field in some
other object, placed into it's proximity, yes, but I don't think we
should put @Contended on every class that has frequently read fields.
@Contended should be reserved for classes with fields that are
frequently written, if I understand the concept correctly.
Regards, Peter
Laurent
2013/5/9 Peter Levart <peter.lev...@gmail.com
<mailto:peter.lev...@gmail.com>>
Hi Aleksey,
Wouldn't it be even better if just threadLocalRandom* fields were
annotated with @Contended("ThreadLocal") ?
Some fields within the Thread object are accessed from non-local
threads. I don't know how frequently, but isolating just
threadLocalRandom* fields from all possible false-sharing
scenarios would seem even better, no?
Regards, Peter
On 05/08/2013 07:29 PM, Aleksey Shipilev wrote:
Hi,
This is from our backlog after JDK-8005926. After
ThreadLocalRandom
state was merged into Thread, we now have to deal with the
false sharing
induced by heavily-updated fields in Thread. TLR was padded
before, and
it should make sense to make Thread bear @Contended annotation to
isolate its fields in the same manner.
The webrev is here:
http://cr.openjdk.java.net/~shade/8014233/webrev.00/
<http://cr.openjdk.java.net/%7Eshade/8014233/webrev.00/>
Testing:
- microbenchmarks (see below)
- JPRT cycle against jdk8-tl
The extended rationale for the change follows.
If we look at the current Thread layout, we can see the TLR
state is
buried within the Thread instance. TLR state are by far the mostly
updated fields in Thread now:
Running 64-bit HotSpot VM.
Using compressed references with 3-bit shift.
Objects are 8 bytes aligned.
java.lang.Thread
offset size type description
0 12 (assumed to be
the object header + first field alignment)
12 4 int Thread.priority
16 8 long Thread.eetop
24 8 long Thread.stackSize
32 8 long
Thread.nativeParkEventPointer
40 8 long Thread.tid
48 8 long
Thread.threadLocalRandomSeed
56 4 int Thread.threadStatus
60 4 int
Thread.threadLocalRandomProbe
64 4 int
Thread.threadLocalRandomSecondarySeed
68 1 boolean Thread.single_step
69 1 boolean Thread.daemon
70 1 boolean Thread.stillborn
71 1 (alignment/padding gap)
72 4 char[] Thread.name
76 4 Thread Thread.threadQ
80 4 Runnable Thread.target
84 4 ThreadGroup Thread.group
88 4 ClassLoader
Thread.contextClassLoader
92 4 AccessControlContext
Thread.inheritedAccessControlContext
96 4 ThreadLocalMap Thread.threadLocals
100 4 ThreadLocalMap
Thread.inheritableThreadLocals
104 4 Object Thread.parkBlocker
108 4 Interruptible Thread.blocker
112 4 Object Thread.blockerLock
116 4 UncaughtExceptionHandler
Thread.uncaughtExceptionHandler
120 (object boundary,
size estimate)
VM reports 120 bytes per instance
Assuming current x86 hardware with 64-byte cache line sizes
and current
class layout, we can see the trailing fields in Thread are
providing
enough insulation from the false sharing with an adjacent
object. Also,
the Thread itself is large enough so that two TLRs belonging to
different threads will not collide.
However the leading fields are not enough: we have a few words
which can
occupy the same cache line, but belong to another object. This
is where
things can get worse in two ways: a) the TLR update can make
the field
access in adjacent object considerably slower; and much worse
b) the
update in the adjacent field can disturb the TLR state, which is
critical for j.u.concurrent performance relying heavily on
fast TLR.
To illustrate both points, there is a simple benchmark driven
by JMH
(http://openjdk.java.net/projects/code-tools/jmh/):
http://cr.openjdk.java.net/~shade/8014233/threadbench.zip
<http://cr.openjdk.java.net/%7Eshade/8014233/threadbench.zip>
On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and
Thread with/without @Contended that microbenchmark yields the
following
results [20x1 sec warmup, 20x1 sec measurements, 10 forks]:
Accessing ThreadLocalRandom.current().nextInt():
baseline: 932 +- 4 ops/usec
@Contended: 927 +- 10 ops/usec
Accessing TLR.current.nextInt() *and* Thread.getUEHandler():
baseline: 454 +- 2 ops/usec
@Contended: 490 +- 3 ops/usec
One might note the $uncaughtExceptionHandler is the trailing
field in
the Thread, so it can naturally be false-shared with the adjacent
thread's TLR. We had chosen this as the illustration, in real
examples
with multitude objects on the heap, we can get another contender.
So that is ~10% performance hit on false sharing even on very
small
machine. Translating it back: having heavily-updated field in
the object
adjacent to Thread can bring these overheads to TLR, and then
jeopardize
j.u.c performance.
Of course, as soon as status quo about field layout is
changed, we might
start to lose spectacularly. I would recommend we deal with
this now, so
less surprises come in the future.
The caveat is that we are wasting some of the space per Thread
instance.
After the patch, we layout is:
java.lang.Thread
offset size type description
0 12 (assumed to be the
object header + first field alignment)
12 128 (alignment/padding gap)
140 4 int Thread.priority
144 8 long Thread.eetop
152 8 long Thread.stackSize
160 8 long
Thread.nativeParkEventPointer
168 8 long Thread.tid
176 8 long
Thread.threadLocalRandomSeed
184 4 int Thread.threadStatus
188 4 int
Thread.threadLocalRandomProbe
192 4 int
Thread.threadLocalRandomSecondarySeed
196 1 boolean Thread.single_step
197 1 boolean Thread.daemon
198 1 boolean Thread.stillborn
199 1 (alignment/padding gap)
200 4 char[] Thread.name
204 4 Thread Thread.threadQ
208 4 Runnable Thread.target
212 4 ThreadGroup Thread.group
216 4 ClassLoader
Thread.contextClassLoader
220 4 AccessControlContext
Thread.inheritedAccessControlContext
224 4 ThreadLocalMap Thread.threadLocals
228 4 ThreadLocalMap
Thread.inheritableThreadLocals
232 4 Object Thread.parkBlocker
236 4 Interruptible Thread.blocker
240 4 Object Thread.blockerLock
244 4 UncaughtExceptionHandler
Thread.uncaughtExceptionHandler
248 (object boundary,
size estimate)
VM reports 376 bytes per instance
...and we have additional 256 bytes per Thread (twice the
-XX:ContendedPaddingWidth, actually). Seems irrelevant
comparing to the
space wasted in native memory for each thread, especially
stack areas.
Thanks,
Aleksey.