Hi all, A stupid question: any ThreadLocal subclass should be marked @Contended to be sure that false sharing never happens between ThreadLocal instance and any other object on the heap ?
Laurent 2013/5/9 Peter Levart <peter.lev...@gmail.com> > Hi Aleksey, > > Wouldn't it be even better if just threadLocalRandom* fields were > annotated with @Contended("ThreadLocal") ? > Some fields within the Thread object are accessed from non-local threads. > I don't know how frequently, but isolating just threadLocalRandom* fields > from all possible false-sharing scenarios would seem even better, no? > > Regards, Peter > > > On 05/08/2013 07:29 PM, Aleksey Shipilev wrote: > >> Hi, >> >> This is from our backlog after JDK-8005926. After ThreadLocalRandom >> state was merged into Thread, we now have to deal with the false sharing >> induced by heavily-updated fields in Thread. TLR was padded before, and >> it should make sense to make Thread bear @Contended annotation to >> isolate its fields in the same manner. >> >> The webrev is here: >> >> http://cr.openjdk.java.net/~**shade/8014233/webrev.00/<http://cr.openjdk.java.net/~shade/8014233/webrev.00/> >> >> Testing: >> - microbenchmarks (see below) >> - JPRT cycle against jdk8-tl >> >> The extended rationale for the change follows. >> >> If we look at the current Thread layout, we can see the TLR state is >> buried within the Thread instance. TLR state are by far the mostly >> updated fields in Thread now: >> >> Running 64-bit HotSpot VM. >>> Using compressed references with 3-bit shift. >>> Objects are 8 bytes aligned. >>> >>> java.lang.Thread >>> offset size type description >>> 0 12 (assumed to be the object >>> header + first field alignment) >>> 12 4 int Thread.priority >>> 16 8 long Thread.eetop >>> 24 8 long Thread.stackSize >>> 32 8 long Thread.nativeParkEventPointer >>> 40 8 long Thread.tid >>> 48 8 long Thread.threadLocalRandomSeed >>> 56 4 int Thread.threadStatus >>> 60 4 int Thread.threadLocalRandomProbe >>> 64 4 int Thread.** >>> threadLocalRandomSecondarySeed >>> 68 1 boolean Thread.single_step >>> 69 1 boolean Thread.daemon >>> 70 1 boolean Thread.stillborn >>> 71 1 (alignment/padding gap) >>> 72 4 char[] Thread.name >>> 76 4 Thread Thread.threadQ >>> 80 4 Runnable Thread.target >>> 84 4 ThreadGroup Thread.group >>> 88 4 ClassLoader Thread.contextClassLoader >>> 92 4 AccessControlContext Thread.** >>> inheritedAccessControlContext >>> 96 4 ThreadLocalMap Thread.threadLocals >>> 100 4 ThreadLocalMap Thread.inheritableThreadLocals >>> 104 4 Object Thread.parkBlocker >>> 108 4 Interruptible Thread.blocker >>> 112 4 Object Thread.blockerLock >>> 116 4 UncaughtExceptionHandler Thread.** >>> uncaughtExceptionHandler >>> 120 (object boundary, size estimate) >>> VM reports 120 bytes per instance >>> >> >> Assuming current x86 hardware with 64-byte cache line sizes and current >> class layout, we can see the trailing fields in Thread are providing >> enough insulation from the false sharing with an adjacent object. Also, >> the Thread itself is large enough so that two TLRs belonging to >> different threads will not collide. >> >> However the leading fields are not enough: we have a few words which can >> occupy the same cache line, but belong to another object. This is where >> things can get worse in two ways: a) the TLR update can make the field >> access in adjacent object considerably slower; and much worse b) the >> update in the adjacent field can disturb the TLR state, which is >> critical for j.u.concurrent performance relying heavily on fast TLR. >> >> To illustrate both points, there is a simple benchmark driven by JMH >> (http://openjdk.java.net/**projects/code-tools/jmh/<http://openjdk.java.net/projects/code-tools/jmh/> >> ): >> >> http://cr.openjdk.java.net/~**shade/8014233/threadbench.zip<http://cr.openjdk.java.net/~shade/8014233/threadbench.zip> >> >> On my 2x2 i5-2520M Linux x86_64 laptop, running latest jdk8-tl and >> Thread with/without @Contended that microbenchmark yields the following >> results [20x1 sec warmup, 20x1 sec measurements, 10 forks]: >> >> Accessing ThreadLocalRandom.current().**nextInt(): >> baseline: 932 +- 4 ops/usec >> @Contended: 927 +- 10 ops/usec >> >> Accessing TLR.current.nextInt() *and* Thread.getUEHandler(): >> baseline: 454 +- 2 ops/usec >> @Contended: 490 +- 3 ops/usec >> >> One might note the $uncaughtExceptionHandler is the trailing field in >> the Thread, so it can naturally be false-shared with the adjacent >> thread's TLR. We had chosen this as the illustration, in real examples >> with multitude objects on the heap, we can get another contender. >> >> So that is ~10% performance hit on false sharing even on very small >> machine. Translating it back: having heavily-updated field in the object >> adjacent to Thread can bring these overheads to TLR, and then jeopardize >> j.u.c performance. >> >> Of course, as soon as status quo about field layout is changed, we might >> start to lose spectacularly. I would recommend we deal with this now, so >> less surprises come in the future. >> >> The caveat is that we are wasting some of the space per Thread instance. >> After the patch, we layout is: >> >> java.lang.Thread >>> offset size type description >>> 0 12 (assumed to be the object header >>> + first field alignment) >>> 12 128 (alignment/padding gap) >>> 140 4 int Thread.priority >>> 144 8 long Thread.eetop >>> 152 8 long Thread.stackSize >>> 160 8 long Thread.nativeParkEventPointer >>> 168 8 long Thread.tid >>> 176 8 long Thread.threadLocalRandomSeed >>> 184 4 int Thread.threadStatus >>> 188 4 int Thread.threadLocalRandomProbe >>> 192 4 int Thread.** >>> threadLocalRandomSecondarySeed >>> 196 1 boolean Thread.single_step >>> 197 1 boolean Thread.daemon >>> 198 1 boolean Thread.stillborn >>> 199 1 (alignment/padding gap) >>> 200 4 char[] Thread.name >>> 204 4 Thread Thread.threadQ >>> 208 4 Runnable Thread.target >>> 212 4 ThreadGroup Thread.group >>> 216 4 ClassLoader Thread.contextClassLoader >>> 220 4 AccessControlContext Thread.** >>> inheritedAccessControlContext >>> 224 4 ThreadLocalMap Thread.threadLocals >>> 228 4 ThreadLocalMap Thread.inheritableThreadLocals >>> 232 4 Object Thread.parkBlocker >>> 236 4 Interruptible Thread.blocker >>> 240 4 Object Thread.blockerLock >>> 244 4 UncaughtExceptionHandler Thread.** >>> uncaughtExceptionHandler >>> 248 (object boundary, size estimate) >>> VM reports 376 bytes per instance >>> >> ...and we have additional 256 bytes per Thread (twice the >> -XX:ContendedPaddingWidth, actually). Seems irrelevant comparing to the >> space wasted in native memory for each thread, especially stack areas. >> >> Thanks, >> Aleksey. >> > >