Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-11 Thread Laurent Bourgès
Jim and Sergey,

1/ Here are few benchmarks (based on mapBench again) running several
modified versions of AAShapePipe:
http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
- ref:
1 threads and 20 loops per thread, time: 3742 ms
2 threads and 20 loops per thread, time: 4756 ms
4 threads and 20 loops per thread, time: 8528 ms

1 threads and 20 loops per thread, time: 56264 ms
2 threads and 20 loops per thread, time: 75566 ms
4 threads and 20 loops per thread, time: 141546 ms

- int4:
1 threads and 20 loops per thread, time: 3653 ms
2 threads and 20 loops per thread, time: 4684 ms
4 threads and 20 loops per thread, time: 8291 ms

1 threads and 20 loops per thread, time: 55950 ms
2 threads and 20 loops per thread, time: 74796 ms
4 threads and 20 loops per thread, time: 139924 ms

- byte[]:
1 threads and 20 loops per thread, time: 3795 ms
2 threads and 20 loops per thread, time: 4605 ms
4 threads and 20 loops per thread, time: 8246 ms

1 threads and 20 loops per thread, time: 54961 ms
2 threads and 20 loops per thread, time: 72768 ms
4 threads and 20 loops per thread, time: 139430 ms

- int4 / byte[] / rectangle cached in TileState:
1 threads and 20 loops per thread, time: 3610 ms
2 threads and 20 loops per thread, time: 4481 ms
4 threads and 20 loops per thread, time: 8225 ms

1 threads and 20 loops per thread, time: 54651 ms
2 threads and 20 loops per thread, time: 74516 ms
4 threads and 20 loops per thread, time: 140153 ms

So you may be right, results are varying depending on the optimizations
(int4, byte or all) !
Maybe I should test different versions on optimized pisces renderer ...

Here is an updated patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


2/ Thanks for your comments: actually a refactoring is possible to use a
(shared) TileState instance replacing int[] bbox, rectangle bbox):
- RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

it is very interesting here to propose an extensible tile state: maybe
created by the renderer engine to cache other data ?

- Rectangle and Rectangle2D are only used as the shape s and device
rectangle given to CompositePipe.startSequence():
public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
int[] abox);

Changing this interface may become difficult:
AlphaColorPipe.java:
41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
OK, [s, dev, abox] unused

AlphaPaintPipe.java
81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
create a paint context:
PaintContext paintContext =
sg.paint.createContext(sg.getDeviceColorModel(),
   devR,
   s.getBounds2D(),
   sg.cloneTransform(),
   sg.getRenderingHints());

GeneralCompositePipe.java:
62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
abox unused and create a paint context:
PaintContext paintContext =
sg.paint.createContext(model, devR, s.getBounds2D(),
   sg.cloneTransform(),
   hints);

SpanClipRenderer.java
68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
Forward to another composite pipe
return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

It could be possible to use TileState into PaintContext interface / fix
implementations but it may become a tricky change (API change).

What do you think ?

Laurent

2013/4/11 Jim Graham james.gra...@oracle.com

 I'm pretty familiar with all of this code and there aren't any places that
 save the tile array that I remember.  The embedded code that Pisces was
 taken from had some caching of alpha arrays, but we didn't use or keep that
 when we converted it for use in the JDK...

 It occurs to me that since you are collecting the various pieces of
 information into an object to store in the thread local storage, perhaps we
 should convert to a paradigm where an entire Tile Generation sequence uses
 that object TileState? as its main way to communicate info around the
 various stages.  Thus, you don't really need an int[4] to store the 4
 parameters, they could be stored directly in the TileState object. This
 would require more sweeping changes to the pipeline, but it might make the
 code a bit more readable (and make the hits to convert over more moot as
 they would be improving readability and give more focus to the
 relationships between all of the various bits of data).  Then it simply
 becomes a matter of managing the lifetime and allocation of the TileState
 objects which is a minor update to the newly refactored code.

 ...jim

 On 4/10/13 3:59 PM, Sergey Bylokhov wrote:

  On 4/10/13 11:46 PM, Laurent Bourgčs wrote:

 I see that some methods which take it as argument doesn't use them. And
 most of the time we pass AATileGenerator and abox[] to the 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-11 Thread Laurent Bourgès
Last idea: I will enhance Andrea's mapBench benchmark to have statistics
per threads: number of loops, avg, min, max, stddev;

I guess that the total bench time is not so representative as the thread
pool can distribute the work load differently at each test = statistics
will help to have better timing / comparison between bench runs.

Regards,
Laurent

2013/4/11 Laurent Bourgès bourges.laur...@gmail.com

 Jim and Sergey,

 1/ Here are few benchmarks (based on mapBench again) running several
 modified versions of AAShapePipe:
 http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
 - ref:
 1 threads and 20 loops per thread, time: 3742 ms
 2 threads and 20 loops per thread, time: 4756 ms
 4 threads and 20 loops per thread, time: 8528 ms

 1 threads and 20 loops per thread, time: 56264 ms
 2 threads and 20 loops per thread, time: 75566 ms
 4 threads and 20 loops per thread, time: 141546 ms

 - int4:
 1 threads and 20 loops per thread, time: 3653 ms
 2 threads and 20 loops per thread, time: 4684 ms
 4 threads and 20 loops per thread, time: 8291 ms

 1 threads and 20 loops per thread, time: 55950 ms
 2 threads and 20 loops per thread, time: 74796 ms
 4 threads and 20 loops per thread, time: 139924 ms

 - byte[]:
 1 threads and 20 loops per thread, time: 3795 ms
 2 threads and 20 loops per thread, time: 4605 ms
 4 threads and 20 loops per thread, time: 8246 ms

 1 threads and 20 loops per thread, time: 54961 ms
 2 threads and 20 loops per thread, time: 72768 ms
 4 threads and 20 loops per thread, time: 139430 ms

 - int4 / byte[] / rectangle cached in TileState:
 1 threads and 20 loops per thread, time: 3610 ms
 2 threads and 20 loops per thread, time: 4481 ms
 4 threads and 20 loops per thread, time: 8225 ms

 1 threads and 20 loops per thread, time: 54651 ms
 2 threads and 20 loops per thread, time: 74516 ms
 4 threads and 20 loops per thread, time: 140153 ms

 So you may be right, results are varying depending on the optimizations
 (int4, byte or all) !
 Maybe I should test different versions on optimized pisces renderer ...

 Here is an updated patch:
 http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


 2/ Thanks for your comments: actually a refactoring is possible to use a
 (shared) TileState instance replacing int[] bbox, rectangle bbox):
 - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

 it is very interesting here to propose an extensible tile state: maybe
 created by the renderer engine to cache other data ?

 - Rectangle and Rectangle2D are only used as the shape s and device
 rectangle given to CompositePipe.startSequence():
 public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
 int[] abox);

 Changing this interface may become difficult:
 AlphaColorPipe.java:

 41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
 OK, [s, dev, abox] unused

 AlphaPaintPipe.java

 81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(sg.getDeviceColorModel(),
devR,
s.getBounds2D(),
sg.cloneTransform(),
sg.getRenderingHints());

 GeneralCompositePipe.java:

 62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 abox unused and create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(model, devR, s.getBounds2D(),
sg.cloneTransform(),
hints);

 SpanClipRenderer.java

 68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle 
 devR,
 Forward to another composite pipe
 return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

 It could be possible to use TileState into PaintContext interface / fix
 implementations but it may become a tricky change (API change).

 What do you think ?

 Laurent


 2013/4/11 Jim Graham james.gra...@oracle.com

 I'm pretty familiar with all of this code and there aren't any places
 that save the tile array that I remember.  The embedded code that Pisces
 was taken from had some caching of alpha arrays, but we didn't use or keep
 that when we converted it for use in the JDK...

 It occurs to me that since you are collecting the various pieces of
 information into an object to store in the thread local storage, perhaps we
 should convert to a paradigm where an entire Tile Generation sequence uses
 that object TileState? as its main way to communicate info around the
 various stages.  Thus, you don't really need an int[4] to store the 4
 parameters, they could be stored directly in the TileState object. This
 would require more sweeping changes to the pipeline, but it might make the
 code a bit more readable (and make the hits to convert over more moot as
 they would be improving readability 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-11 Thread Jim Graham

Hi Laurent,

Yes, these kinds of minor optimizations (i.e. optimizations that don't 
make a clear 2x type of savings) can be frustrating at times.  It looks 
like there is potential for a decent return there if we can find the 
right change.  Sometimes rearranging a couple of things that don't look 
like they are saving work can somehow trip the runtime into executing 
the code more efficiently.


I skimmed through your thoughts at the bottom.  It occurred to me after 
I sent that idea out that sometimes we use int[] because we have to hand 
the values to native for return values and there is no easy way to 
return 4 values from a native method.  An array is simplest because it 
can be loaded with answers via a single JNI call.  4 fields in a class 
would require 4xJNI.SetField calls.  It might have better payoff if we 
can cache renderer state there as well which gets into subclassing. 
Also, doing this right may have to be done by someone here at Oracle 
because it may involve modifying the Ductus pipeline to match (it's been 
a while and I don't remember if we open sourced the code that interfaces 
Ductus to the RenderingEngine interfaces...?)


...jim

On 4/11/13 6:07 AM, Laurent Bourgès wrote:

Jim and Sergey,

1/ Here are few benchmarks (based on mapBench again) running several
modified versions of AAShapePipe:
http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/
- ref:
1 threads and 20 loops per thread, time: 3742 ms
2 threads and 20 loops per thread, time: 4756 ms
4 threads and 20 loops per thread, time: 8528 ms

1 threads and 20 loops per thread, time: 56264 ms
2 threads and 20 loops per thread, time: 75566 ms
4 threads and 20 loops per thread, time: 141546 ms

- int4:
1 threads and 20 loops per thread, time: 3653 ms
2 threads and 20 loops per thread, time: 4684 ms
4 threads and 20 loops per thread, time: 8291 ms

1 threads and 20 loops per thread, time: 55950 ms
2 threads and 20 loops per thread, time: 74796 ms
4 threads and 20 loops per thread, time: 139924 ms

- byte[]:
1 threads and 20 loops per thread, time: 3795 ms
2 threads and 20 loops per thread, time: 4605 ms
4 threads and 20 loops per thread, time: 8246 ms

1 threads and 20 loops per thread, time: 54961 ms
2 threads and 20 loops per thread, time: 72768 ms
4 threads and 20 loops per thread, time: 139430 ms

- int4 / byte[] / rectangle cached in TileState:
1 threads and 20 loops per thread, time: 3610 ms
2 threads and 20 loops per thread, time: 4481 ms
4 threads and 20 loops per thread, time: 8225 ms

1 threads and 20 loops per thread, time: 54651 ms
2 threads and 20 loops per thread, time: 74516 ms
4 threads and 20 loops per thread, time: 140153 ms

So you may be right, results are varying depending on the optimizations
(int4, byte or all) !
Maybe I should test different versions on optimized pisces renderer ...

Here is an updated patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/


2/ Thanks for your comments: actually a refactoring is possible to use a
(shared) TileState instance replacing int[] bbox, rectangle bbox):
- RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox)

it is very interesting here to propose an extensible tile state: maybe
created by the renderer engine to cache other data ?

- Rectangle and Rectangle2D are only used as the shape s and device
rectangle given to CompositePipe.startSequence():
 public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
int[] abox);

Changing this interface may become difficult:
AlphaColorPipe.java:
 41:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev,
OK, [s, dev, abox] unused

AlphaPaintPipe.java
 81:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(sg.getDeviceColorModel(),
devR,
s.getBounds2D(),
sg.cloneTransform(),
sg.getRenderingHints());

GeneralCompositePipe.java:
 62:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
abox unused and create a paint context:
 PaintContext paintContext =
 sg.paint.createContext(model, devR, s.getBounds2D(),
sg.cloneTransform(),
hints);

SpanClipRenderer.java
 68:  public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR,
Forward to another composite pipe
return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox));

It could be possible to use TileState into PaintContext interface / fix
implementations but it may become a tricky change (API change).

What do you think ?

Laurent

2013/4/11 Jim Graham james.gra...@oracle.com


I'm pretty familiar with all of this code and there aren't any places that
save the tile array that I remember.  The embedded code 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Andrea,
I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT = 4 virtual
cpus) on linux 64 (fedora 14).
Note: I always use cpufrequtils to set the cpu governor to performance and
use fixed frequency = 2.8Ghz:
[bourgesl@jmmc-laurent ~]$ uname -a
Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP Wed
Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

2013/4/10 Andrea Aime andrea.a...@geo-solutions.it

 On Tue, Apr 9, 2013 at 7:34 PM, Laurent Bourgès bourges.laur...@gmail.com
  wrote:

 Also, this should be tested on multiple platforms, preferably Linux,
 Windows and Mac to see how it is affected by differences in the platform
 runtimes and threading (hopefully minimal).

 It appears more difficult for me: I can use at work a mac 10.8 and I can
 run Windows XP within virtual box (but it is not very representative).


 I believe I can run MapBench on my Linux 64bit box during the next
 weekend, that would add a platform, and one were the
 server side behavior is enabled by default. And hopefully run the other
 benchmark as well.


I also run j2DBench but I can try also Java2D.demos to perform regression
tests.



 Laurent, have you made any changes to MapBench since I've sent it to you?


Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics) and
added explicit GC before tests (same initial conditions):
http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

Look at 
MapBench-src.ziphttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zipfor
test changes.

Thanks for your efforts,
Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Dear Jim,

2013/4/9 Jim Graham james.gra...@oracle.com


 The allocations will always show up on a heap profiler, I don't know of
 any way of having them not show up if they are stack allocated, but I don't
 think that stack allocation is the issue here - small allocations come out
 of a fast generation that costs almost nothing to allocate from and nearly
 nothing to clean up.  They are actually getting allocated and GC'd, but the
 process is optimized.

 The only way to tell is to benchmark and see which changes make a
 difference and which are in the noise (or, in some odd counter-intuitive
 cases, counter-productive)...

 ...jim


I advocate I like GC because it avoids in Java dealing with pointers like
C/C++ does; however, I prefer GC clean real garbage (application...) than
wasted memory:
I prefer not count on GC when I can avoid wasting memory that gives GC more
work = reduce useless garbage (save the planet) !

Moreover, GC and / or Thread local allocation (TLAB) seems to have more
overhead than you think = fast generation that costs almost nothing to
allocate from and nearly nothing to clean up.

Here are my micro-benchmark results related to int[4] allocation where I
mimic the AAShapePipe.fillParallelogram() method:
   Patch Ref Gain  5,96 8,27 138,76%  7,31 14,96 204,65%  10,65 20,4 191,55%
15,44 29,83 193,20%
Test environment:
Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
JVM settings:
-XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m  -Xmx128m

Benchmark code (using Peter Levart microbench classes):
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

My conclusion is:  nothing  zero (allocation + cleanup) and it is very
noticeable in multi threading tests.

I advocate that I use a dirty int[4] array (no cleanup) but it is not
necessary : maybe the performance gain come from that reason.


Finally here is the output with  -XX:+PrintTLAB flag:
TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 20
waste  1,2% gc: 323712B slow: 600B fast: 0B
TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
7,9% gc: 745568B slow: 176B fast: 0B
TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 15
waste  3,1% gc: 618464B slow: 448B fast: 0B
TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 7 waste
0,0% gc: 0B slow: 232B fast: 0B
TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB slow
allocs: 0  refill waste: 20992B alloc: 1,065600KB refills: 1 waste
27,5% gc: 369088B slow: 0B fast: 0B
TLAB totals: thrds: 5  refills: 50 max: 20 slow allocs: 0 max 0 waste:
3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B

I would have expected that TLAB can recycle all useless int[4] arrays as
fast as possible = waste = 100% ???

*Is there any bug in TLAB (core-libs) ?
Should I send such issue to hotspot team ?
*

*Test using ThreadLocal AAShapePipeContext:*
{
AAShapePipeContext ctx = getThreadContext();
int abox[] = ctx.abox;

// use array:
// mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y,
dx1, dy1, dx2, dy2, 0, 0, clip, abox);
abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox);
devNull1.yield(abox);

if (!useThreadLocal) {
restoreContext(ctx);
}
}

-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
 JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-
# ContextGetInt4: run duration: 10 000 ms
#
# Warm up:
#   4 threads, Tavg = 13,84 ns/op (σ =   0,23 ns/op), Total ops
=   2889056179 [13,93 (717199825), 13,87 (720665624), 13,48
(741390545), 14,09 (709800185)]
#   4 threads, Tavg = 14,25 ns/op (σ =   0,57 ns/op), Total ops
=   2811615084 [15,21 (658351236), 14,18 (706254551), 13,94
(718202949), 13,74 (728806348)]
cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total ops =
1678357614 [ 5,96 (1678357614)]
2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total ops =
2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total ops =
2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
**4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total ops =
2589897733 [17,05 (586353618), 19,23 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Peter Levart

Hi Laurent,

Could you disable tiered compilation for performance tests? Tiered 
compilation is usually a source of jitter in the results. Pass 
-XX:-TieredCompilation to the VM.


Regards, Peter


On 04/10/2013 10:58 AM, Laurent Bourgès wrote:


Dear Jim,

2013/4/9 Jim Graham james.gra...@oracle.com 
mailto:james.gra...@oracle.com



The allocations will always show up on a heap profiler, I don't
know of any way of having them not show up if they are stack
allocated, but I don't think that stack allocation is the issue
here - small allocations come out of a fast generation that costs
almost nothing to allocate from and nearly nothing to clean up.
 They are actually getting allocated and GC'd, but the process is
optimized.

The only way to tell is to benchmark and see which changes make a
difference and which are in the noise (or, in some odd
counter-intuitive cases, counter-productive)...

...jim


I advocate I like GC because it avoids in Java dealing with pointers 
like C/C++ does; however, I prefer GC clean real garbage 
(application...) than wasted memory:
I prefer not count on GC when I can avoid wasting memory that gives GC 
more work = reduce useless garbage (save the planet) !


Moreover, GC and / or Thread local allocation (TLAB) seems to have 
more overhead than you think = fast generation that costs almost 
nothing to allocate from and nearly nothing to clean up.


Here are my micro-benchmark results related to int[4] allocation where 
I mimic the AAShapePipe.fillParallelogram() method:

Patch   Ref Gain
5,968,27138,76%
7,3114,96   204,65%
10,65   20,4191,55%
15,44   29,83   193,20%


Test environment:
Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus)
JVM settings:
-XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m  -Xmx128m

Benchmark code (using Peter Levart microbench classes):
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ 
http://jmmc.fr/%7Ebourgesl/share/AAShapePipe/microbench/


My conclusion is:  nothing  zero (allocation + cleanup) and it is 
very noticeable in multi threading tests.


I advocate that I use a dirty int[4] array (no cleanup) but it is not 
necessary : maybe the performance gain come from that reason.



Finally here is the output with  -XX:+PrintTLAB flag:
TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB 
slow allocs: 0  refill waste: 20992B alloc: 1,0 65600KB refills: 
20 waste  1,2% gc: 323712B slow: 600B fast: 0B
TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB 
slow allocs: 0  refill waste: 20992B alloc: 1,0 65600KB refills: 7 
waste  7,9% gc: 745568B slow: 176B fast: 0B
TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB 
slow allocs: 0  refill waste: 20992B alloc: 1,0 65600KB refills: 
15 waste  3,1% gc: 618464B slow: 448B fast: 0B
TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB 
slow allocs: 0  refill waste: 20992B alloc: 1,0 65600KB refills: 7 
waste  0,0% gc: 0B slow: 232B fast: 0B
TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB 
slow allocs: 0  refill waste: 20992B alloc: 1,0 65600KB refills: 1 
waste 27,5% gc: 369088B slow: 0B fast: 0B
TLAB totals: thrds: 5  refills: 50 max: 20 slow allocs: 0 max 0 
waste:  3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B 
max: 0B


I would have expected that TLAB can recycle all useless int[4] arrays 
as fast as possible = waste = 100% ???


*Is there any bug in TLAB (core-libs) ?
Should I send such issue to hotspot team ?
*

*Test using ThreadLocal AAShapePipeContext:*
{
AAShapePipeContext ctx = getThreadContext();
int abox[] = ctx.abox;

// use array:
// mimic: AATileGenerator aatg = 
renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip, 
abox);

abox[0] = 7;
abox[1] = 11;
abox[2] = 13;
abox[3] = 17;

// mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, 
abox);

devNull1.yield(abox);

if (!useThreadLocal) {
restoreContext(ctx);
}
}

-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags 
-XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers 
-XX:+UseCompressedOops -XX:+UseParallelGC

 JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24]
#-
# ContextGetInt4: run duration: 10 000 ms
#
# Warm up:
#   4 threads, Tavg = 13,84 ns/op (σ =   0,23 ns/op), 
Total ops =   2889056179 [13,93 (717199825), 13,87 
(720665624), 13,48 (741390545), 14,09 (709800185)]
#   4 threads, Tavg = 14,25 ns/op (σ =   0,57 ns/op), 
Total ops =   2811615084 [15,21 (658351236), 14,18 
(706254551), 13,94 (718202949), 13,74 (728806348)]

cleanup (explicit Full GC) ...
cleanup done.
# Measure:
*1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total ops =   
1678357614 [ 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Peter,

1/ I modified your TestRunner class to print total ops and perform explicit
GC before runTests:
http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/

2/ I applied your advice but it does not change much:

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
  JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # ContextGetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 13,84 ns/op (σ =   0,23
ns/op), Total ops =   2889056179 [13,93 (717199825), 13,87
(720665624), 13,48 (741390545), 14,09 (709800185)]
 #   4 threads, Tavg = 14,25 ns/op (σ =   0,57
ns/op), Total ops =   2811615084 [15,21 (658351236), 14,18
(706254551), 13,94 (718202949), 13,74 (728806348)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  5,96 ns/op (σ =   0,00 ns/op), Total
ops =   1678357614 [ 5,96 (1678357614)]
 2 threads, Tavg =  7,33 ns/op (σ =   0,03 ns/op), Total
ops =   2729723450 [ 7,31 (1369694121),  7,36 (1360029329)]
 3 threads, Tavg = 10,65 ns/op (σ =   2,73 ns/op), Total
ops =   2817154340 [13,24 (755190111), 13,23 (755920429),  7,66
(1306043800)]
 4 threads, Tavg = 15,44 ns/op (σ =   3,33 ns/op), Total
ops =   2589897733 [17,05 (586353618), 19,23 (519345153), 17,88
(559401974), 10,81 (924796988)]

 -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728
-XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal
-XX:-TieredCompilation -XX:+UseCompressedKlassPointers
-XX:+UseCompressedOops -XX:+UseParallelGC
  JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM
25.0-b24]
 #-
 # GetInt4: run duration: 10 000 ms
 #
 # Warm up:
 #   4 threads, Tavg = 31,56 ns/op (σ =   0,43
ns/op), Total ops =   1267295706 [31,30 (319512554), 31,02
(32229), 32,12 (311334550), 31,82 (314155269)]
 #   4 threads, Tavg = 30,75 ns/op (σ =   1,81
ns/op), Total ops =   1302123211 [32,21 (310949394), 32,37
(309275124), 27,87 (359125007), 31,01 (322773686)]
 cleanup (explicit Full GC) ...
 cleanup done.
 # Measure:
 1 threads, Tavg =  8,36 ns/op (σ =   0,00 ns/op), Total
ops =   1196238323 [ 8,36 (1196238323)]
 2 threads, Tavg = 14,95 ns/op (σ =   0,04 ns/op), Total
ops =   1337648720 [15,00 (666813210), 14,91 (670835510)]
 3 threads, Tavg = 20,65 ns/op (σ =   0,99 ns/op), Total
ops =   1453119707 [19,57 (511100480), 21,97 (455262170), 20,54
(486757057)]
 4 threads, Tavg = 30,76 ns/op (σ =   0,54 ns/op), Total
ops =   1301090278 [31,51 (317527231), 30,79 (324921525), 30,78
(325063322), 29,99 (333578200)]
 #
  JVM END

3/ I tried several heap settings: without Xms/Xmx ... but it has almost no
effect.

*Should I play with TLAB resize / initial size ? or different GC collector
(G1 ...) ?

Does anybody can explain me what PrintTLAB mean ?*

Laurent

2013/4/10 Peter Levart peter.lev...@gmail.com

  Hi Laurent,

 Could you disable tiered compilation for performance tests? Tiered
 compilation is usually a source of jitter in the results. Pass
 -XX:-TieredCompilation to the VM.

 Regards, Peter



 On 04/10/2013 10:58 AM, Laurent Bourgès wrote:

  Dear Jim,

 2013/4/9 Jim Graham james.gra...@oracle.com


 The allocations will always show up on a heap profiler, I don't know of
 any way of having them not show up if they are stack allocated, but I don't
 think that stack allocation is the issue here - small allocations come out
 of a fast generation that costs almost nothing to allocate from and nearly
 nothing to clean up.  They are actually getting allocated and GC'd, but the
 process is optimized.

 The only way to tell is to benchmark and see which changes make a
 difference and which are in the noise (or, in some odd counter-intuitive
 cases, counter-productive)...

 ...jim


 I advocate I like GC because it avoids in Java dealing with pointers like
 C/C++ does; however, I prefer GC clean real garbage (application...) than
 wasted memory:
 I prefer not count on GC when I can avoid wasting memory that gives GC
 more work = reduce useless garbage (save the planet) !

 Moreover, GC and / or Thread local allocation (TLAB) seems to have more
 overhead than you think = fast 

Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Andrea Aime
On Wed, Apr 10, 2013 at 10:14 AM, Laurent Bourgès bourges.laur...@gmail.com
 wrote:

 Andrea,
 I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT = 4 virtual
 cpus) on linux 64 (fedora 14).
 Note: I always use cpufrequtils to set the cpu governor to performance and
 use fixed frequency = 2.8Ghz:
 [bourgesl@jmmc-laurent ~]$ uname -a
 Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP
 Wed Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux


Yes, I did the same when I initially run the MapBench on JDK 7 vs OpenJDK 7
(governor settings wise).

Since you are already running on that platform, maybe I can try to cover
Linux 32 bit instead, I also have a notebook with that setup.


 Laurent, have you made any changes to MapBench since I've sent it to you?


 Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics)
 and added explicit GC before tests (same initial conditions):
 http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/

 Look at 
 MapBench-src.ziphttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zipfor
  test changes.


Thanks

Cheers
Andrea

-- 
==
GeoServer training in Milan, 6th  7th June 2013!  Visit
http://geoserver.geo-solutions.it for more information.
==

Ing. Andrea Aime
@geowolf
Technical Lead

GeoSolutions S.A.S.
Via Poggio alle Viti 1187
55054  Massarosa (LU)
Italy
phone: +39 0584 962313
fax: +39 0584 1660272
mob: +39  339 8844549

http://www.geo-solutions.it
http://twitter.com/geosolutions_it

---


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Sergey Bylokhov

Hi, Laurent.
I am not an expert here but just my 50 cents.
This optimization shall take place only if it is really hotspot. But if 
it is a really hotspot - probably it would be better to remove these 
array/object allocation at all and use plane bytes?
I see that some methods which take it as argument doesn't use them. And 
most of the time we pass AATileGenerator and abox[] to the same methods, 
so it could be merged?


Also I suggest to use jmh for java micrbenchmarks.
http://openjdk.java.net/projects/code-tools/jmh
So your test will be:
http://cr.openjdk.java.net/~serb/AAShapePipeBenchmark.java



--
Best regards, Sergey.



Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Laurent Bourgès
Sergey,

I am not an expert here but just my 50 cents.
 This optimization shall take place only if it is really hotspot. But if it
 is a really hotspot - probably it would be better to remove these
 array/object allocation at all and use plane bytes?


Java2D calls AAShapePipe for each shape (line, rectangle ...) rendering so
it is an hotspot for me for big drawings as it will depends on the drawing
complexity (for example, Andrea MapBench can produce maps having more than
100 000 shapes per image ...)


 I see that some methods which take it as argument doesn't use them. And
 most of the time we pass AATileGenerator and abox[] to the same methods, so
 it could be merged?


For now I did not want to modify the AAShapePipe signatures: abox[] is
filled by AATileGenerator implementations (ductus, pisces, others) in order
to have the shape bounds and render only tiles covering this area.



 Also I suggest to use jmh for java micrbenchmarks.
 http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh
 So your test will be:
 http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java


Thanks,
I will try it asap

Laurent




 --
 Best regards, Sergey.




Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Sergey Bylokhov

On 4/10/13 11:46 PM, Laurent Bourgès wrote:

I see that some methods which take it as argument doesn't use them. And
most of the time we pass AATileGenerator and abox[] to the same methods, so
it could be merged?


For now I did not want to modify the AAShapePipe signatures: abox[] is
filled by AATileGenerator implementations (ductus, pisces, others) in order
to have the shape bounds and render only tiles covering this area.
You still have to check all the places, where these objects are filled 
and used, and refactoring is a good start, no?
Otherwise, how can you prove that these arrays are used as you would 
expect, These arrays could be stored like the cache or re-used for other 
purpose(if someone don't want to create new arrays).

Probably it will be good to split all your changes / to a few CR.
 - cleanup
 - Some small changes which gave us most speedup
 - all other things.
??



Also I suggest to use jmh for java micrbenchmarks.
http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh
So your test will be:
http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java


Thanks,
I will try it asap

Laurent



--
Best regards, Sergey.





--
Best regards, Sergey.



Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-10 Thread Jim Graham
I'm pretty familiar with all of this code and there aren't any places 
that save the tile array that I remember.  The embedded code that Pisces 
was taken from had some caching of alpha arrays, but we didn't use or 
keep that when we converted it for use in the JDK...


It occurs to me that since you are collecting the various pieces of 
information into an object to store in the thread local storage, perhaps 
we should convert to a paradigm where an entire Tile Generation sequence 
uses that object TileState? as its main way to communicate info around 
the various stages.  Thus, you don't really need an int[4] to store the 
4 parameters, they could be stored directly in the TileState object. 
This would require more sweeping changes to the pipeline, but it might 
make the code a bit more readable (and make the hits to convert over 
more moot as they would be improving readability and give more focus to 
the relationships between all of the various bits of data).  Then it 
simply becomes a matter of managing the lifetime and allocation of the 
TileState objects which is a minor update to the newly refactored code.


...jim

On 4/10/13 3:59 PM, Sergey Bylokhov wrote:

On 4/10/13 11:46 PM, Laurent Bourgès wrote:

I see that some methods which take it as argument doesn't use them. And
most of the time we pass AATileGenerator and abox[] to the same
methods, so
it could be merged?


For now I did not want to modify the AAShapePipe signatures: abox[] is
filled by AATileGenerator implementations (ductus, pisces, others) in
order
to have the shape bounds and render only tiles covering this area.

You still have to check all the places, where these objects are filled
and used, and refactoring is a good start, no?
Otherwise, how can you prove that these arrays are used as you would
expect, These arrays could be stored like the cache or re-used for other
purpose(if someone don't want to create new arrays).
Probably it will be good to split all your changes / to a few CR.
  - cleanup
  - Some small changes which gave us most speedup
  - all other things.
??



Also I suggest to use jmh for java micrbenchmarks.
http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh

So your test will be:
http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java



Thanks,
I will try it asap

Laurent



--
Best regards, Sergey.







Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-09 Thread Jim Graham

Hi Laurent,

Quick questions - which benchmarks were run before/after?  I see a lot of 
benchmark running in your Pisces improvement thread, but but none here.

Also, this should be tested on multiple platforms, preferably Linux, Windows 
and Mac to see how it is affected by differences in the platform runtimes and 
threading (hopefully minimal).

Finally, Hotspot is supposed to deal very well for small thread-local 
allocations like the int[4] and Rectangle2D that you optimized.  Was it 
necessary to cache those at all?  I'm sure the statistics for the allocations 
show up in a memory profile, but that doesn't mean it is costing us anything - 
ideally such small allocations are as fast as free and having to deal with 
caching them in a context will actually lose performance.  It may be that the 
tile caching saved enough that it might have masked unnecessary or detrimental 
changes for the smaller objects...

...jim

On 4/5/2013 5:20 AM, Laurent Bourgès wrote:

Dear java2d members,

I figured out some troubles in java2d.pipe.AAShapePipe related to both concurrency 
 memory usage:
- concurrency issue related to static theTile field: only 1 tile is cached so a 
new byte[] is created for other threads at each call to renderTile()
- excessive memory usage (byte[] for tile, int[] and rectangle): at each call 
to renderPath / renderTiles, several small objects are created (never cached) 
that leads to hundreds megabytes that GC must deal with

Here are profiling screenshots:
- 4 threads drawing on their own buffered image (MapBench test):
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

- excessive int[] / Rectangle creation:
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

Here is the proposed patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue (see 
useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max).
As its memory footprint is very small, I recommend using ThreadLocal.

Is it necessary to use Soft/Weak reference to avoid excessive memory usage for 
such cache ?

Is there any class dedicated to such cache (ThreadLocal with cache eviction or 
ConcurrentLinkedQueue using WeakReference ?) ?
I think it could be very useful at the JDK level to have such feature (ie a generic 
GC friendlycache )

Regards,
Laurent


Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-09 Thread Laurent Bourgès
Dear Jim,

I advocated I only looked at the netbeans memory profiler's output: no more
megabytes allocated !

The main question is: how to know how GC / hotspot deals with such small
allocations ? Is there any JVM flag to enable to see real allocations as
does jmap -histo.


 Quick questions - which benchmarks were run before/after?  I see a lot of
 benchmark running in your Pisces improvement thread, but but none here.


Agreed; I can try running j2dBench on this fix only. I generally run
Andrea's MapBench as I appeared more complex and using multiple threads.


 Also, this should be tested on multiple platforms, preferably Linux,
 Windows and Mac to see how it is affected by differences in the platform
 runtimes and threading (hopefully minimal).


It appears more difficult for me: I can use at work a mac 10.8 and I can
run Windows XP within virtual box (but it is not very representative).

Don't you have at oracle any test platform to perform such tests /
benchmark ?


 Finally, Hotspot is supposed to deal very well for small thread-local
 allocations like the int[4] and Rectangle2D that you optimized.  Was it
 necessary to cache those at all?  I'm sure the statistics for the
 allocations show up in a memory profile, but that doesn't mean it is
 costing us anything - ideally such small allocations are as fast as free
 and having to deal with caching them in a context will actually lose
 performance.  It may be that the tile caching saved enough that it might
 have masked unnecessary or detrimental changes for the smaller objects...


I repeat my question: how can I know at runtime how hotspot optimizes
AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ?
is it explained somewhere (allocation size threshold) ?

Maybe verbose:gc output may help ?

Finally I spent a lot of time on pisces renderer and running MapBench to
show performance gains.

Thanks for your interesting feedback,

Laurent

On 4/5/2013 5:20 AM, Laurent Bourgčs wrote:

 Dear java2d members,

 I figured out some troubles in java2d.pipe.AAShapePipe related to both
 concurrency  memory usage:
 - concurrency issue related to static theTile field: only 1 tile is cached
 so a new byte[] is created for other threads at each call to renderTile()
 - excessive memory usage (byte[] for tile, int[] and rectangle): at each
 call to renderPath / renderTiles, several small objects are created (never
 cached) that leads to hundreds megabytes that GC must deal with

 Here are profiling screenshots:
 - 4 threads drawing on their own buffered image (MapBench test):
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**byte_tile.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

 - excessive int[] / Rectangle creation:
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**int_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**
 rectangle_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

 Here is the proposed patch:
 http://jmmc.fr/~bourgesl/**share/AAShapePipe/webrev-1/http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

 I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
 (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
 max).
 As its memory footprint is very small, I recommend using ThreadLocal.

 Is it necessary to use Soft/Weak reference to avoid excessive memory usage
 for such cache ?

 Is there any class dedicated to such cache (ThreadLocal with cache
 eviction or ConcurrentLinkedQueue using WeakReference ?) ?
 I think it could be very useful at the JDK level to have such feature (ie
 a generic GC friendlycache )

 Regards,
 Laurent



Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste

2013-04-09 Thread Jim Graham

Hi Laurent,

The allocations will always show up on a heap profiler, I don't know of any way 
of having them not show up if they are stack allocated, but I don't think that 
stack allocation is the issue here - small allocations come out of a fast 
generation that costs almost nothing to allocate from and nearly nothing to 
clean up.  They are actually getting allocated and GC'd, but the process is 
optimized.

The only way to tell is to benchmark and see which changes make a difference 
and which are in the noise (or, in some odd counter-intuitive cases, 
counter-productive)...

...jim

On 4/9/2013 10:34 AM, Laurent Bourgès wrote:

Dear Jim,

I advocated I only looked at the netbeans memory profiler's output: no more 
megabytes allocated !

The main question is: how to know how GC / hotspot deals with such small 
allocations ? Is there any JVM flag to enable to see real allocations as does 
jmap -histo.


Quick questions - which benchmarks were run before/after?  I see a lot of 
benchmark running in your Pisces improvement thread, but but none here.


Agreed; I can try running j2dBench on this fix only. I generally run Andrea's 
MapBench as I appeared more complex and using multiple threads.

Also, this should be tested on multiple platforms, preferably Linux, 
Windows and Mac to see how it is affected by differences in the platform 
runtimes and threading (hopefully minimal).


It appears more difficult for me: I can use at work a mac 10.8 and I can run 
Windows XP within virtual box (but it is not very representative).

Don't you have at oracle any test platform to perform such tests / benchmark ?

Finally, Hotspot is supposed to deal very well for small thread-local 
allocations like the int[4] and Rectangle2D that you optimized.  Was it 
necessary to cache those at all?  I'm sure the statistics for the allocations 
show up in a memory profile, but that doesn't mean it is costing us anything - 
ideally such small allocations are as fast as free and having to deal with 
caching them in a context will actually lose performance.  It may be that the 
tile caching saved enough that it might have masked unnecessary or detrimental 
changes for the smaller objects...


I repeat my question: how can I know at runtime how hotspot optimizes 
AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ? is 
it explained somewhere (allocation size threshold) ?

Maybe verbose:gc output may help ?

Finally I spent a lot of time on pisces renderer and running MapBench to show 
performance gains.

Thanks for your interesting feedback,

Laurent

On 4/5/2013 5:20 AM, Laurent Bourgčs wrote:

Dear java2d members,

I figured out some troubles in java2d.pipe.AAShapePipe related to both 
concurrency  memory usage:
- concurrency issue related to static theTile field: only 1 tile is cached 
so a new byte[] is created for other threads at each call to renderTile()
- excessive memory usage (byte[] for tile, int[] and rectangle): at each 
call to renderPath / renderTiles, several small objects are created (never 
cached) that leads to hundreds megabytes that GC must deal with

Here are profiling screenshots:
- 4 threads drawing on their own buffered image (MapBench test):
http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___byte_tile.png 
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

- excessive int[] / Rectangle creation:
http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___int_bbox.png 
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___rectangle_bbox.png 
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

Here is the proposed patch:
http://jmmc.fr/~bourgesl/__share/AAShapePipe/webrev-1/ 
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue 
(see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max).
As its memory footprint is very small, I recommend using ThreadLocal.

Is it necessary to use Soft/Weak reference to avoid excessive memory usage 
for such cache ?

Is there any class dedicated to such cache (ThreadLocal with cache eviction 
or ConcurrentLinkedQueue using WeakReference ?) ?
I think it could be very useful at the JDK level to have such feature (ie a generic 
GC friendlycache )

Regards,
Laurent




AAShapePipe concurrency memory waste

2013-04-05 Thread Laurent Bourgès
Dear java2d members,

I figured out some troubles in java2d.pipe.AAShapePipe related to both
concurrency  memory usage:
- concurrency issue related to static theTile field: only 1 tile is cached
so a new byte[] is created for other threads at each call to renderTile()
- excessive memory usage (byte[] for tile, int[] and rectangle): at each
call to renderPath / renderTiles, several small objects are created (never
cached) that leads to hundreds megabytes that GC must deal with

Here are profiling screenshots:
- 4 threads drawing on their own buffered image (MapBench test):
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png

- excessive int[] / Rectangle creation:
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png
http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png

Here is the proposed patch:
http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/

I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue
(see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K
max).
As its memory footprint is very small, I recommend using ThreadLocal.

Is it necessary to use Soft/Weak reference to avoid excessive memory usage
for such cache ?

Is there any class dedicated to such cache (ThreadLocal with cache eviction
or ConcurrentLinkedQueue using WeakReference ?) ?
I think it could be very useful at the JDK level to have such feature (ie a
generic GC friendlycache )

Regards,
Laurent