Hi Laurent,
Could you disable tiered compilation for performance tests? Tiered
compilation is usually a source of jitter in the results. Pass
-XX:-TieredCompilation to the VM.
Regards, Peter
On 04/10/2013 10:58 AM, Laurent Bourgès wrote:
Dear Jim,
2013/4/9 Jim Graham <james.gra...@oracle.com
<mailto:james.gra...@oracle.com>>
The allocations will always show up on a heap profiler, I don't
know of any way of having them not show up if they are stack
allocated, but I don't think that stack allocation is the issue
here - small allocations come out of a fast generation that costs
almost nothing to allocate from and nearly nothing to clean up.
They are actually getting allocated and GC'd, but the process is
optimized.
The only way to tell is to benchmark and see which changes make a
difference and which are in the noise (or, in some odd
counter-intuitive cases, counter-productive)...
...jim
I advocate I like GC because it avoids in Java dealing with pointers
like C/C++ does; however, I prefer GC clean real garbage
(application...) than wasted memory:
I prefer not count on GC when I can avoid wasting memory that gives GC
more work = reduce useless garbage (save the planet) !
Moreover, GC and / or Thread local allocation (TLAB) seems to have
more overhead than you think = "fast generation that costs almost
nothing to allocate from and nearly nothing to clean up".
Here are my micro-benchmark results related to int[4] allocation where
I mimic the AAShapePipe.fillParallelogram() method:
Patch Ref Gain
5,96 8,27 138,76%
7,31 14,96 204,65%
10,65 20,4 191,55%
15,44 29,83 193,20%
Test environment:
Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
JVM settings:
-XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m
Benchmark code (using Peter Levart microbench classes):
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/
<http://jmmc.fr/%7Ebourgesl/share/AAShapePipe/microbench/>
My conclusion is: "nothing" > zero (allocation + cleanup) and it is
very noticeable in multi threading tests.
I advocate that I use a dirty int[4] array (no cleanup) but it is not
necessary : maybe the performance gain come from that reason.
Finally here is the output with -XX:+PrintTLAB flag:
TLAB: gc thread: 0x00007f105813d000 [id: 4053] desired_size: 1312KB
slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills:
20 waste 1,2% gc: 323712B slow: 600B fast: 0B
TLAB: gc thread: 0x00007f105813a800 [id: 4052] desired_size: 1312KB
slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7
waste 7,9% gc: 745568B slow: 176B fast: 0B
TLAB: gc thread: 0x00007f1058138800 [id: 4051] desired_size: 1312KB
slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills:
15 waste 3,1% gc: 618464B slow: 448B fast: 0B
TLAB: gc thread: 0x00007f1058136800 [id: 4050] desired_size: 1312KB
slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 7
waste 0,0% gc: 0B slow: 232B fast: 0B
TLAB: gc thread: 0x00007f1058009000 [id: 4037] desired_size: 1312KB
slow allocs: 0 refill waste: 20992B alloc: 1,00000 65600KB refills: 1
waste 27,5% gc: 369088B slow: 0B fast: 0B
TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0
waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B
max: 0B
I would have expected that TLAB can recycle all useless int[4] arrays
as fast as possible => waste = 100% ???
*Is there any bug in TLAB (core-libs) ?
Should I send such issue to hotspot team ?
*
*Test using ThreadLocal AAShapePipeContext:*
{
AAShapePipeContext ctx = getThreadContext();
int abox[] = ctx.abox;
// use array:
// mimic: AATileGenerator aatg =
renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip,
abox);
abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;
// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg,
abox);
devNull1.yield(abox);
if (!useThreadLocal) {
restoreContext(ctx);
}
}
-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags
-XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
>> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-------------------------------------------------------------
# ContextGetInt4: run duration: 10 000 ms
#
# Warm up:
# 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op),
Total ops = 2889056179 [ 13,93 (717199825), 13,87
(720665624), 13,48 (741390545), 14,09 (709800185)]
# 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op),
Total ops = 2811615084 [ 15,21 (658351236), 14,18
(706254551), 13,94 (718202949), 13,74 (728806348)]
cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops =
1678357614 [ 5,96 (1678357614)]
2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops =
2729723450 [ 7,31 (1369694121), 7,36 (1360029329)]
3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops =
2817154340 [ 13,24 (755190111), 13,23 (755920429), 7,66
(1306043800)]
**4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops =
2589897733 [ 17,05 (586353618), 19,23 (519345153), 17,88
(559401974), 10,81 *(924796988)]
#
<< JVM END
*Test using standard int[4] allocation:*
{
int abox[] = new int[4];
// use array:
// mimic: AATileGenerator aatg =
renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip,
abox);
abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;
// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg,
abox);
devNull1.yield(abox);
}
-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags
-XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
>> JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-------------------------------------------------------------
# GetInt4: run duration: 10 000 ms
#
# Warm up:
# 4 threads, Tavg = 31,07 ns/op (σ = 0,60 ns/op),
Total ops = 1287292142 [ 30,26 (330475567), 31,92
(313328449), 31,27 (319805520), 30,89 (323682606)]
# 4 threads, Tavg = 30,94 ns/op (σ = 0,33 ns/op),
Total ops = 1293000783 [ 30,92 (323382193), 30,61
(326730340), 31,48 (317621402), 30,74 (325266848)]
cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg = 8,27 ns/op (σ = 0,00 ns/op), Total ops =
1209213909 [ 8,27 (1209213909)]
2 threads, Tavg = 14,96 ns/op (σ = 0,04 ns/op), Total ops =
1337024734 [ 15,00 (666659967), 14,92 (670364767)]
3 threads, Tavg = 20,40 ns/op (σ = 1,03 ns/op), Total ops =
1470560922 [ 21,21 (471592958), 19,00 (526302911), 21,16
(472665053)]
**4 threads, Tavg = 29,83 ns/op (σ = 1,82 ns/op), Total ops =
1340065128 [ 31,17 (320806983), 31,58 (316358130), 26,94
(370806790), 30,11 *(332093225)]
#
<< JVM END
Best regards,
Laurent