Hi Ben, Here is the webrev online:
http://cr.openjdk.java.net/~psandoz/jdk/buffer-reachability-fence/webrev/index.html (I don’t know if you have any colleagues with author or above roles in OpenJDK to upload for you, it might be faster.) Reference.java — 423 @ForceInline 424 public static void reachabilityFence(Object ref) { 425 // Does nothing, because this method is annotated with @DontInline 426 // HotSpot needs to retain the ref and not GC it before a call to this 427 // method 428 } We need to update the comment, preferably using a summary of Vladimir’s analysis. Direct-X-Buffer-bin.java.template — 34 private $type$ get$Type$(long a) { 35 $memtype$ x = UNSAFE.get$Memtype$Unaligned(null, a, bigEndian); 36 $type$ y = $fromBits$(x); 37 Reference.reachabilityFence(this); 38 return y; 39 } It’s overkill in the above case but for good practice reasons i recommend for all usages using a try/finally block as suggested in the JavaDoc for Reference.reachabilityFence. Direct-X-Buffer.java.template — 260 public $type$ get() { 261 return $fromBits$($swap$(UNSAFE.get$Swaptype$(ix(nextGetIndex())))); 262 } 263 264 public $type$ get(int i) { 265 return $fromBits$($swap$(UNSAFE.get$Swaptype$(ix(checkIndex(i))))); 266 } 267 268 #if[streamableType] 269 $type$ getUnchecked(int i) { 270 return $fromBits$($swap$(UNSAFE.get$Swaptype$(ix(i)))); 271 } 272 #end[streamableType] Missing fences. We also need to look carefully at the bulk operations as well, you have a fence for the bulk put accepting a ByteBuffer although a fence is likely required on the src as well. 506 byte _get(int i) { // package-private 507 return UNSAFE.getByte(address + i); 508 } AFAICT the _get and _put methods are no longer used and the code could be deleted (left over from other refactoring to the view classes). Thanks, Paul. > On Feb 19, 2018, at 8:37 AM, Ben Walsh <ben_wa...@uk.ibm.com> wrote: > > As requested, here are the results with modifications to the annotations > on Reference.reachabilityFence. Much more promising ... > > > * Benchmark 1 * > > Test Code : > > package org.sample; > > import org.openjdk.jmh.annotations.Benchmark; > import org.openjdk.jmh.annotations.Level; > import org.openjdk.jmh.annotations.Scope; > import org.openjdk.jmh.annotations.Setup; > import org.openjdk.jmh.annotations.State; > > import java.nio.ByteBuffer; > > public class ByteBufferBenchmark { > > @State(Scope.Benchmark) > public static class ByteBufferContainer { > > ByteBuffer bb; > > @Setup(Level.Invocation) > public void initByteBuffer() { > bb = ByteBuffer.allocateDirect(1); > } > > ByteBuffer getByteBuffer() { > return bb; > } > } > > @Benchmark > public void benchmark_byte_buffer_put(ByteBufferContainer bbC) { > > bbC.getByteBuffer().put((byte)42); > } > > } > > Results : > > - Unmodified Build - > > Benchmark Mode Cnt Score > Error Units > ByteBufferBenchmark.benchmark_byte_buffer_put thrpt 200 35604933.518 ± > 654975.515 ops/s > > - Build With Reference.reachabilityFences Added - > > Benchmark Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put thrpt 200 33100911.857 ± > 747461.951 ops/s -7.033% > > - Build With Reference.reachabilityFences Added And DontInline Replaced > With ForceInline - > > Benchmark Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put thrpt 200 34836320.294 ± > 640188.408 ops/s -2.159% > > - Build With Reference.reachabilityFences Added And DontInline Removed - > > Benchmark Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put thrpt 200 34740015.332 ± > 556578.542 ops/s -2.429% > > > * Benchmark 2 * > > Test Code : > > package org.sample; > > import org.openjdk.jmh.annotations.Benchmark; > import org.openjdk.jmh.annotations.Level; > import org.openjdk.jmh.annotations.Param; > import org.openjdk.jmh.annotations.Scope; > import org.openjdk.jmh.annotations.Setup; > import org.openjdk.jmh.annotations.State; > > import java.nio.ByteBuffer; > > @State(Scope.Benchmark) > public class ByteBufferBenchmark { > > @Param({"1", "10", "100", "1000", "10000"}) > public int L; > > @State(Scope.Benchmark) > public static class ByteBufferContainer { > > ByteBuffer bb; > > @Setup(Level.Invocation) > public void initByteBuffer() { > bb = ByteBuffer.allocateDirect(10000); > } > > ByteBuffer getByteBuffer() { > return bb; > } > } > > @Benchmark > public ByteBuffer benchmark_byte_buffer_put(ByteBufferContainer bbC) { > > ByteBuffer bb = bbC.getByteBuffer(); > > for (int i = 0; i < L; i++) { > bb.put((byte)i); > } > > return bb; > } > > } > > Results : > > - Unmodified Build - > > Benchmark (L) Mode Cnt Score > Error Units > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29303145.752 ± 635979.750 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 24260859.017 ± 528891.303 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 8512366.637 ± 136615.070 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1323756.037 ± 21485.369 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 145965.305 ± 1301.469 ops/s > > - Build With Reference.reachabilityFences Added - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 28893540.122 ± 754554.747 ops/s -1.398% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 15317696.355 ± 231621.608 ops/s -36.863% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 2546599.578 ± 32136.873 ops/s -70.084% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 288832.514 ± 3854.522 ops/s -78.181% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 29747.386 > ± 214.831 ops/s -79.620% > > - Build With Reference.reachabilityFences Added And DontInline Replaced > With ForceInline - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29372326.859 ± 525988.179 ops/s +0.236% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 24326735.480 ± 484358.862 ops/s +0.272% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 8492692.912 ± 120924.878 ops/s -0.231% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1332131.417 ± 14981.587 ops/s +0.633% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 144990.569 ± 1518.877 ops/s -0.668% > > - Build With Reference.reachabilityFences Added And DontInline Removed - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29842696.017 ± 462902.634 ops/s +1.841% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 24842729.069 ± 436174.452 ops/s +2.398% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 8518393.953 ± 129254.536 ops/s +0.071% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1344772.370 ± 15916.867 ops/s +1.588% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 145087.256 ± 1277.491 ops/s -0.602% > > > * Benchmark 3 * > > Test Code : > > package org.sample; > > import org.openjdk.jmh.annotations.Benchmark; > import org.openjdk.jmh.annotations.Level; > import org.openjdk.jmh.annotations.Param; > import org.openjdk.jmh.annotations.Scope; > import org.openjdk.jmh.annotations.Setup; > import org.openjdk.jmh.annotations.State; > > import java.nio.ByteBuffer; > > @State(Scope.Benchmark) > public class ByteBufferBenchmark { > > @Param({"1", "10", "100", "1000", "10000"}) > public int L; > > @State(Scope.Benchmark) > public static class ByteBufferContainer { > > ByteBuffer bb; > > @Setup(Level.Invocation) > public void initByteBuffer() { > bb = ByteBuffer.allocateDirect(4 * 10000); > > for (int i = 0; i < 10000; i++) { > bb.putInt(i); > } > } > > ByteBuffer getByteBuffer() { > return bb; > } > > } > > @Benchmark > public int benchmark_byte_buffer_put(ByteBufferContainer bbC) { > > ByteBuffer bb = bbC.getByteBuffer(); > > bb.position(0); > > int sum = 0; > > for (int i = 0; i < L; i++) { > sum += bb.getInt(); > } > > return sum; > > } > > } > > Results : > > - Unmodified Build - > > Benchmark (L) Mode Cnt Score > Error Units > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29677205.748 ± 544721.142 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 18219951.454 ± 320724.793 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 7767650.826 ± 121798.910 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1646075.010 ± 9804.499 ops/s > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 183489.418 ± 1355.967 ops/s > > - Build With Reference.reachabilityFences Added - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 15230086.695 ± 390174.190 ops/s -48.681% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 8126310.728 ± 123661.342 ops/s -55.399% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 1582699.233 ± 7278.744 ops/s -79.624% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 179726.465 ± 802.333 ops/s -89.082% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 18327.049 > ± 9.506 ops/s -90.012% > > - Build With Reference.reachabilityFences Added And DontInline Replaced > With ForceInline - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29839190.147 ± 576585.796 ops/s +0.546% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 18397768.759 ± 338144.327 ops/s +0.976% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 7746079.875 ± 101621.105 ops/s -0.278% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1629413.444 ± 24163.399 ops/s -1.012% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 182250.811 ± 2028.461 ops/s -0.675% > > - Build With Reference.reachabilityFences Added And DontInline Removed - > > Benchmark (L) Mode Cnt Score > Error Units Impact > ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 > 29442980.464 ± 556324.877 ops/s -0.789% > ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 > 18401757.539 ± 419383.901 ops/s +0.998% > ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 > 7816766.062 ± 100144.611 ops/s +0.632% > ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 > 1636811.564 ± 13811.447 ops/s -0.563% > ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 183463.292 ± 2056.016 ops/s -0.014% > > > Regards, > Ben > > > > From: Paul Sandoz <paul.san...@oracle.com> > To: Ben Walsh <ben_wa...@uk.ibm.com> > Cc: core-libs-dev <core-libs-dev@openjdk.java.net> > Date: 08/02/2018 16:54 > Subject: Re: [PATCH] Reduce Chance Of Mistakenly Early Backing > Memory Cleanup > > > > Hi Ben, > > Thanks. I anticipated a performance hit but not necessarily a 10x. Without > looking at the generated code of the benchmark method it is hard to be > sure [*], but i believe the fence is interfering with loop unrolling > and/or vectorization, the comparative differences between byte and int may > be related to vectorization (for byte there may be less or limited support > for vectorization). > > How about we now try another experiment commenting out the @DontInline on > the fence method and re-run the benchmarks. From Peter’s observations and > Vladimir’s analysis we should be able to remove that, or even, contrary to > what we initial expected when adding this feature, change to @ForceInline! > > Thanks, > Paul. > > [*] If you are running on linux you can use the excellent JMH perfasm > feature to dump the hot parts of HotSpots generated code. > >> On Feb 8, 2018, at 8:22 AM, Ben Walsh <ben_wa...@uk.ibm.com> wrote: >> >> Hi Paul, >> >> Following up with the requested loop and vectorization benchmarks ... >> >> >> (Do the vectorization benchmark results imply that the Hotspot compiler >> has been unable to perform the vectorization optimisation due to the >> presence of the reachabilityFence ?) >> >> >> > ----------------------------------------------------------------------------------------------------------------------- >> >> >> Loop Benchmarking >> ---- ------------ >> >> package org.sample; >> >> import org.openjdk.jmh.annotations.Benchmark; >> import org.openjdk.jmh.annotations.Level; >> import org.openjdk.jmh.annotations.Param; >> import org.openjdk.jmh.annotations.Scope; >> import org.openjdk.jmh.annotations.Setup; >> import org.openjdk.jmh.annotations.State; >> >> import java.nio.ByteBuffer; >> >> @State(Scope.Benchmark) >> public class ByteBufferBenchmark { >> >> @Param({"1", "10", "100", "1000", "10000"}) >> public int L; >> >> @State(Scope.Benchmark) >> public static class ByteBufferContainer { >> >> ByteBuffer bb; >> >> @Setup(Level.Invocation) >> public void initByteBuffer() { >> bb = ByteBuffer.allocateDirect(10000); >> } >> >> ByteBuffer getByteBuffer() { >> return bb; >> } >> } >> >> @Benchmark >> public ByteBuffer benchmark_byte_buffer_put(ByteBufferContainer bbC) > { >> >> ByteBuffer bb = bbC.getByteBuffer(); >> >> for (int i = 0; i < L; i++) { >> bb.put((byte)i); >> } >> >> return bb; >> } >> >> } >> >> >> Without Changes >> >> Benchmark (L) Mode Cnt Score >> Error Units >> ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 >> 29303145.752 ± 635979.750 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 >> 24260859.017 ± 528891.303 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 >> 8512366.637 ± 136615.070 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 >> 1323756.037 ± 21485.369 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 >> 145965.305 ± 1301.469 ops/s >> >> >> With Changes >> >> Benchmark (L) Mode Cnt Score >> Error Units Impact >> ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 >> 28893540.122 ± 754554.747 ops/s -1.398% >> ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 >> 15317696.355 ± 231621.608 ops/s -36.863% >> ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 >> 2546599.578 ± 32136.873 ops/s -70.084% >> ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 >> 288832.514 ± 3854.522 ops/s -78.181% >> ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 29747.386 >> ± 214.831 ops/s -79.620% >> >> >> > ----------------------------------------------------------------------------------------------------------------------- >> >> >> Vectorization Benchmarking >> ------------- ------------ >> >> package org.sample; >> >> import org.openjdk.jmh.annotations.Benchmark; >> import org.openjdk.jmh.annotations.Level; >> import org.openjdk.jmh.annotations.Param; >> import org.openjdk.jmh.annotations.Scope; >> import org.openjdk.jmh.annotations.Setup; >> import org.openjdk.jmh.annotations.State; >> >> import java.nio.ByteBuffer; >> >> @State(Scope.Benchmark) >> public class ByteBufferBenchmark { >> >> @Param({"1", "10", "100", "1000", "10000"}) >> public int L; >> >> @State(Scope.Benchmark) >> public static class ByteBufferContainer { >> >> ByteBuffer bb; >> >> @Setup(Level.Invocation) >> public void initByteBuffer() { >> bb = ByteBuffer.allocateDirect(4 * 10000); >> >> for (int i = 0; i < 10000; i++) { >> bb.putInt(i); >> } >> } >> >> ByteBuffer getByteBuffer() { >> return bb; >> } >> >> } >> >> @Benchmark >> public int benchmark_byte_buffer_put(ByteBufferContainer bbC) { >> >> ByteBuffer bb = bbC.getByteBuffer(); >> >> bb.position(0); >> >> int sum = 0; >> >> for (int i = 0; i < L; i++) { >> sum += bb.getInt(); >> } >> >> return sum; >> >> } >> >> } >> >> >> Without Changes >> >> Benchmark (L) Mode Cnt Score >> Error Units >> ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 >> 29677205.748 ± 544721.142 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 >> 18219951.454 ± 320724.793 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 >> 7767650.826 ± 121798.910 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 >> 1646075.010 ± 9804.499 ops/s >> ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 >> 183489.418 ± 1355.967 ops/s >> >> >> With Changes >> >> Benchmark (L) Mode Cnt Score >> Error Units Impact >> ByteBufferBenchmark.benchmark_byte_buffer_put 1 thrpt 200 >> 15230086.695 ± 390174.190 ops/s -48.681% >> ByteBufferBenchmark.benchmark_byte_buffer_put 10 thrpt 200 >> 8126310.728 ± 123661.342 ops/s -55.399% >> ByteBufferBenchmark.benchmark_byte_buffer_put 100 thrpt 200 >> 1582699.233 ± 7278.744 ops/s -79.624% >> ByteBufferBenchmark.benchmark_byte_buffer_put 1000 thrpt 200 >> 179726.465 ± 802.333 ops/s -89.082% >> ByteBufferBenchmark.benchmark_byte_buffer_put 10000 thrpt 200 > 18327.049 >> ± 9.506 ops/s -90.012% >> >> >> >> NB : For reference - for this and previous benchmarking results ... >> >> "Without Changes" and "With Changes" - java -version ... >> >> openjdk version "10-internal" 2018-03-20 >> OpenJDK Runtime Environment (build 10-internal+0-adhoc.walshbp.jdk) >> OpenJDK 64-Bit Server VM (build 10-internal+0-adhoc.walshbp.jdk, mixed >> mode) >> >> >> > ----------------------------------------------------------------------------------------------------------------------- >> >> >> Regards, >> Ben Walsh >> > > > > Unless stated otherwise above: > IBM United Kingdom Limited - Registered in England and Wales with number > 741598. > Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >