Peter, 1/ I modified your TestRunner class to print total ops and perform explicit GC before runTests: http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/
2/ I applied your advice but it does not change much: -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [ 13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [ 15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] 4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88 (559401974), 10,81 (924796988)] -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:-TieredCompilation -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #------------------------------------------------------------- # GetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 31,56 ns/op (σ = 0,43 ns/op), Total ops = 1267295706 [ 31,30 (319512554), 31,02 (322293333), 32,12 (311334550), 31,82 (314155269)] # 4 threads, Tavg = 30,75 ns/op (σ = 1,81 ns/op), Total ops = 1302123211 [ 32,21 (310949394), 32,37 (309275124), 27,87 (359125007), 31,01 (322773686)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 8,36 ns/op (σ = 0,00 ns/op), Total ops = 1196238323 [ 8,36 (1196238323)] 2 threads, Tavg = 14,95 ns/op (σ = 0,04 ns/op), Total ops = 1337648720 [ 15,00 (666813210), 14,91 (670835510)] 3 threads, Tavg = 20,65 ns/op (σ = 0,99 ns/op), Total ops = 1453119707 [ 19,57 (511100480), 21,97 (455262170), 20,54 (486757057)] 4 threads, Tavg = 30,76 ns/op (σ = 0,54 ns/op), Total ops = 1301090278 [ 31,51 (317527231), 30,79 (324921525), 30,78 (325063322), 29,99 (333578200)] # << JVM END 3/ I tried several heap settings: without Xms/Xmx ... but it has almost no effect. *Should I play with TLAB resize / initial size ? or different GC collector (G1 ...) ? Does anybody can explain me what PrintTLAB mean ?* Laurent 2013/4/10 Peter Levart <peter.lev...@gmail.com> > Hi Laurent, > > Could you disable tiered compilation for performance tests? Tiered > compilation is usually a source of jitter in the results. Pass > -XX:-TieredCompilation to the VM. > > Regards, Peter > > > > On 04/10/2013 10:58 AM, Laurent Bourgès wrote: > > Dear Jim, > > 2013/4/9 Jim Graham <james.gra...@oracle.com> > >> >> The allocations will always show up on a heap profiler, I don't know of >> any way of having them not show up if they are stack allocated, but I don't >> think that stack allocation is the issue here - small allocations come out >> of a fast generation that costs almost nothing to allocate from and nearly >> nothing to clean up. They are actually getting allocated and GC'd, but the >> process is optimized. >> >> The only way to tell is to benchmark and see which changes make a >> difference and which are in the noise (or, in some odd counter-intuitive >> cases, counter-productive)... >> >> ...jim >> > > I advocate I like GC because it avoids in Java dealing with pointers like > C/C++ does; however, I prefer GC clean real garbage (application...) than > wasted memory: > I prefer not count on GC when I can avoid wasting memory that gives GC > more work = reduce useless garbage (save the planet) ! > > Moreover, GC and / or Thread local allocation (TLAB) seems to have more > overhead than you think = "fast generation that costs almost nothing to > allocate from and nearly nothing to clean up". > > Here are my micro-benchmark results related to int[4] allocation where I > mimic the AAShapePipe.fillParallelogram() method: > Patch Ref Gain 5,96 8,27 138,76% 7,31 14,96 204,65% 10,65 20,4 > 191,55% 15,44 29,83 193,20% > Test environment: > Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus) > JVM settings: > -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m > > Benchmark code (using Peter Levart microbench classes): > http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ > > My conclusion is: "nothing" > zero (allocation + cleanup) and it is very > noticeable in multi threading tests. > > I advocate that I use a dirty int[4] array (no cleanup) but it is not > necessary : maybe the performance gain come from that reason. > > > Finally here is the output with -XX:+PrintTLAB flag: > TLAB: gc thread: 0x00007f105813d000 [id: 4053] desired_size: 1312KB slow > allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 20 > waste 1,2% gc: 323712B slow: 600B fast: 0B > TLAB: gc thread: 0x00007f105813a800 [id: 4052] desired_size: 1312KB slow > allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste > 7,9% gc: 745568B slow: 176B fast: 0B > TLAB: gc thread: 0x00007f1058138800 [id: 4051] desired_size: 1312KB slow > allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 15 > waste 3,1% gc: 618464B slow: 448B fast: 0B > TLAB: gc thread: 0x00007f1058136800 [id: 4050] desired_size: 1312KB slow > allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7 waste > 0,0% gc: 0B slow: 232B fast: 0B > TLAB: gc thread: 0x00007f1058009000 [id: 4037] desired_size: 1312KB slow > allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 1 waste > 27,5% gc: 369088B slow: 0B fast: 0B > TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0 waste: > 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B > > I would have expected that TLAB can recycle all useless int[4] arrays as > fast as possible => waste = 100% ??? > > *Is there any bug in TLAB (core-libs) ? > Should I send such issue to hotspot team ? > * > > *Test using ThreadLocal AAShapePipeContext:* > { > AAShapePipeContext ctx = getThreadContext(); > int abox[] = ctx.abox; > > // use array: > // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y, > dx1, dy1, dx2, dy2, 0, 0, clip, abox); > abox[0] = 7; > abox[1] = 11; > abox[2] = 13; > abox[3] = 17; > > // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox); > devNull1.yield(abox); > > if (!useThreadLocal) { > restoreContext(ctx); > } > } > > -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 > -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal > -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC > >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] > #------------------------------------------------------------- > # ContextGetInt4: run duration: 10 000 ms > # > # Warm up: > # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total > ops = 2889056179 [ 13,93 (717199825), 13,87 (720665624), 13,48 > (741390545), 14,09 (709800185)] > # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total > ops = 2811615084 [ 15,21 (658351236), 14,18 (706254551), 13,94 > (718202949), 13,74 (728806348)] > cleanup (explicit Full GC) ... > cleanup done. > # Measure: > *1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = > 1678357614 [ 5,96 (1678357614)] > 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = > 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] > 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = > 2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66 > (1306043800)] > **4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = > 2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88 > (559401974), 10,81 *(924796988)] > # > << JVM END > > *Test using standard int[4] allocation:* > { > int abox[] = new int[4]; > > // use array: > // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y, > dx1, dy1, dx2, dy2, 0, 0, clip, abox); > abox[0] = 7; > abox[1] = 11; > abox[2] = 13; > abox[3] = 17; > > // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox); > devNull1.yield(abox); > } > > -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 > -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal > -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC > >> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] > #------------------------------------------------------------- > # GetInt4: run duration: 10 000 ms > # > # Warm up: > # 4 threads, Tavg = 31,07 ns/op (σ = 0,60 ns/op), Total > ops = 1287292142 [ 30,26 (330475567), 31,92 (313328449), 31,27 > (319805520), 30,89 (323682606)] > # 4 threads, Tavg = 30,94 ns/op (σ = 0,33 ns/op), Total > ops = 1293000783 [ 30,92 (323382193), 30,61 (326730340), 31,48 > (317621402), 30,74 (325266848)] > cleanup (explicit Full GC) ... > cleanup done. > # Measure: > *1 threads, Tavg = 8,27 ns/op (σ = 0,00 ns/op), Total ops = > 1209213909 [ 8,27 (1209213909)] > 2 threads, Tavg = 14,96 ns/op (σ = 0,04 ns/op), Total ops = > 1337024734 [ 15,00 (666659967), 14,92 (670364767)] > 3 threads, Tavg = 20,40 ns/op (σ = 1,03 ns/op), Total ops = > 1470560922 [ 21,21 (471592958), 19,00 (526302911), 21,16 > (472665053)] > **4 threads, Tavg = 29,83 ns/op (σ = 1,82 ns/op), Total ops = > 1340065128 [ 31,17 (320806983), 31,58 (316358130), 26,94 > (370806790), 30,11 *(332093225)] > # > << JVM END > > Best regards, > Laurent > > >