Hi David, It depends on the scenario we are assessing. For the sake of argument, let's say every thread had requested TLR.current() at least once.
Before the merge: Thread maps for ThreadLocal =~ 32 bytes x #threads TLR instances + padding =~ (128 + 8?) bytes x #threads After the merge: TLR fields in Thread + padding =~ (2x128 + 16) x #threads So, there is the additional footprint cost per Thread; but that seems abysmal comparing to what native thread already allocates for its native structures (e.g. stack). Note that @Contended does larger padding anticipating the hardware prefetchers also turned on (VM can get better at this though). Gory details: **** -XX:-EnableContended: **** Running 64-bit HotSpot VM. Using compressed references with 3-bit shift. Objects are 8 bytes aligned. java.lang.Thread offset size type description 0 12 (assumed to be the object header + first field alignment) 12 4 int Thread.priority 16 8 long Thread.eetop 24 8 long Thread.stackSize 32 8 long Thread.nativeParkEventPointer 40 8 long Thread.tid 48 8 long Thread.threadLocalRandomSeed 56 4 int Thread.threadStatus 60 4 int Thread.threadLocalRandomProbe 64 4 int Thread.threadLocalRandomSecondarySeed 68 1 boolean Thread.single_step 69 1 boolean Thread.daemon 70 1 boolean Thread.stillborn 71 1 (alignment/padding gap) 72 4 char[] Thread.name 76 4 Thread Thread.threadQ 80 4 Runnable Thread.target 84 4 ThreadGroup Thread.group 88 4 ClassLoader Thread.contextClassLoader 92 4 AccessControlContext Thread.inheritedAccessControlContext 96 4 ThreadLocalMap Thread.threadLocals 100 4 ThreadLocalMap Thread.inheritableThreadLocals 104 4 Object Thread.parkBlocker 108 4 Interruptible Thread.blocker 112 4 Object Thread.blockerLock 116 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler 120 (object boundary, size estimate) VM reports 120 bytes per instance **** -XX:+EnableContended: **** Running 64-bit HotSpot VM. Using compressed references with 3-bit shift. Objects are 8 bytes aligned. java.lang.Thread offset size type description 0 12 (assumed to be the object header + first field alignment) 12 4 int Thread.priority 16 8 long Thread.eetop 24 8 long Thread.stackSize 32 8 long Thread.nativeParkEventPointer 40 8 long Thread.tid 48 4 int Thread.threadStatus 52 1 boolean Thread.single_step 53 1 boolean Thread.daemon 54 1 boolean Thread.stillborn 55 1 (alignment/padding gap) 56 4 char[] Thread.name 60 4 Thread Thread.threadQ 64 4 Runnable Thread.target 68 4 ThreadGroup Thread.group 72 4 ClassLoader Thread.contextClassLoader 76 4 AccessControlContext Thread.inheritedAccessControlContext 80 4 ThreadLocalMap Thread.threadLocals 84 4 ThreadLocalMap Thread.inheritableThreadLocals 88 4 Object Thread.parkBlocker 92 4 Interruptible Thread.blocker 96 4 Object Thread.blockerLock 100 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler 104 128 (alignment/padding gap) 232 8 long Thread.threadLocalRandomSeed 240 4 int Thread.threadLocalRandomProbe 244 4 int Thread.threadLocalRandomSecondarySeed 248 (object boundary, size estimate) VM reports 376 bytes per instance -Aleksey. On 06/18/2013 06:03 AM, David Holmes wrote: > Hi Aleksey, > > What is the overall change in memory use for this set of changes ie what > did we use pre TLR merging and what do we use now? > > Thanks, > David > > On 17/06/2013 7:00 PM, Aleksey Shipilev wrote: >> Hi, >> >> This is the respin of the RFE filed a month ago: >> >> http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-May/016754.html >> >> The webrev is here: >> http://cr.openjdk.java.net/~shade/8014233/webrev.02/ >> >> Testing: >> - JPRT build passes >> - Linux x86_64/release passes jdk/java/lang jtreg >> - vm.quick.testlist, vm.quick-gc.testlist on selected platforms >> - microbenchmarks, see below >> >> The rationale follows. >> >> After we merged ThreadLocalRandom state in the thread, we are now >> missing the padding to prevent false sharing on those heavily-updated >> fields. While the Thread is already large enough to separate two TLR >> states for two distinct threads, we can still get the false sharing with >> other thread fields. >> >> There is the benchmark showcasing this: >> http://cr.openjdk.java.net/~shade/8014233/threadbench.zip >> >> There are two test cases: first one is only calling its own TLR with >> nextInt() and then the current thread's ID, another test calls *another* >> thread ID, thus inducing the false sharing against another thread's TLR >> state. >> >> On my 2x2 i5 laptop, running Linux x86_64: >> same: 355 +- 1 ops/usec >> other: 100 +- 5 ops/usec >> >> Note the decrease in throughput because of the false sharing. >> >> With the patch: >> same: 359 +- 1 ops/usec >> other: 356 +- 1 ops/usec >> >> Note the performance is back. We want to evade these spurious decreases >> in performance, due to either unlucky memory layout, or the user code >> (un)intentionally ruining the cache line locality for the updater thread. >> >> Thanks, >> -Aleksey. >>