Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Jim and Sergey, 1/ Here are few benchmarks (based on mapBench again) running several modified versions of AAShapePipe: http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/ - ref: 1 threads and 20 loops per thread, time: 3742 ms 2 threads and 20 loops per thread, time: 4756 ms 4 threads and 20 loops per thread, time: 8528 ms 1 threads and 20 loops per thread, time: 56264 ms 2 threads and 20 loops per thread, time: 75566 ms 4 threads and 20 loops per thread, time: 141546 ms - int4: 1 threads and 20 loops per thread, time: 3653 ms 2 threads and 20 loops per thread, time: 4684 ms 4 threads and 20 loops per thread, time: 8291 ms 1 threads and 20 loops per thread, time: 55950 ms 2 threads and 20 loops per thread, time: 74796 ms 4 threads and 20 loops per thread, time: 139924 ms - byte[]: 1 threads and 20 loops per thread, time: 3795 ms 2 threads and 20 loops per thread, time: 4605 ms 4 threads and 20 loops per thread, time: 8246 ms 1 threads and 20 loops per thread, time: 54961 ms 2 threads and 20 loops per thread, time: 72768 ms 4 threads and 20 loops per thread, time: 139430 ms - int4 / byte[] / rectangle cached in TileState: 1 threads and 20 loops per thread, time: 3610 ms 2 threads and 20 loops per thread, time: 4481 ms 4 threads and 20 loops per thread, time: 8225 ms 1 threads and 20 loops per thread, time: 54651 ms 2 threads and 20 loops per thread, time: 74516 ms 4 threads and 20 loops per thread, time: 140153 ms So you may be right, results are varying depending on the optimizations (int4, byte or all) ! Maybe I should test different versions on optimized pisces renderer ... Here is an updated patch: http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/ 2/ Thanks for your comments: actually a refactoring is possible to use a (shared) TileState instance replacing int[] bbox, rectangle bbox): - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox) it is very interesting here to propose an extensible tile state: maybe created by the renderer engine to cache other data ? - Rectangle and Rectangle2D are only used as the shape s and device rectangle given to CompositePipe.startSequence(): public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, int[] abox); Changing this interface may become difficult: AlphaColorPipe.java: 41: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, OK, [s, dev, abox] unused AlphaPaintPipe.java 81: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, create a paint context: PaintContext paintContext = sg.paint.createContext(sg.getDeviceColorModel(), devR, s.getBounds2D(), sg.cloneTransform(), sg.getRenderingHints()); GeneralCompositePipe.java: 62: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, abox unused and create a paint context: PaintContext paintContext = sg.paint.createContext(model, devR, s.getBounds2D(), sg.cloneTransform(), hints); SpanClipRenderer.java 68: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, Forward to another composite pipe return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox)); It could be possible to use TileState into PaintContext interface / fix implementations but it may become a tricky change (API change). What do you think ? Laurent 2013/4/11 Jim Graham james.gra...@oracle.com I'm pretty familiar with all of this code and there aren't any places that save the tile array that I remember. The embedded code that Pisces was taken from had some caching of alpha arrays, but we didn't use or keep that when we converted it for use in the JDK... It occurs to me that since you are collecting the various pieces of information into an object to store in the thread local storage, perhaps we should convert to a paradigm where an entire Tile Generation sequence uses that object TileState? as its main way to communicate info around the various stages. Thus, you don't really need an int[4] to store the 4 parameters, they could be stored directly in the TileState object. This would require more sweeping changes to the pipeline, but it might make the code a bit more readable (and make the hits to convert over more moot as they would be improving readability and give more focus to the relationships between all of the various bits of data). Then it simply becomes a matter of managing the lifetime and allocation of the TileState objects which is a minor update to the newly refactored code. ...jim On 4/10/13 3:59 PM, Sergey Bylokhov wrote: On 4/10/13 11:46 PM, Laurent Bourgčs wrote: I see that some methods which take it as argument doesn't use them. And most of the time we pass AATileGenerator and abox[] to the
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Last idea: I will enhance Andrea's mapBench benchmark to have statistics per threads: number of loops, avg, min, max, stddev; I guess that the total bench time is not so representative as the thread pool can distribute the work load differently at each test = statistics will help to have better timing / comparison between bench runs. Regards, Laurent 2013/4/11 Laurent Bourgès bourges.laur...@gmail.com Jim and Sergey, 1/ Here are few benchmarks (based on mapBench again) running several modified versions of AAShapePipe: http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/ - ref: 1 threads and 20 loops per thread, time: 3742 ms 2 threads and 20 loops per thread, time: 4756 ms 4 threads and 20 loops per thread, time: 8528 ms 1 threads and 20 loops per thread, time: 56264 ms 2 threads and 20 loops per thread, time: 75566 ms 4 threads and 20 loops per thread, time: 141546 ms - int4: 1 threads and 20 loops per thread, time: 3653 ms 2 threads and 20 loops per thread, time: 4684 ms 4 threads and 20 loops per thread, time: 8291 ms 1 threads and 20 loops per thread, time: 55950 ms 2 threads and 20 loops per thread, time: 74796 ms 4 threads and 20 loops per thread, time: 139924 ms - byte[]: 1 threads and 20 loops per thread, time: 3795 ms 2 threads and 20 loops per thread, time: 4605 ms 4 threads and 20 loops per thread, time: 8246 ms 1 threads and 20 loops per thread, time: 54961 ms 2 threads and 20 loops per thread, time: 72768 ms 4 threads and 20 loops per thread, time: 139430 ms - int4 / byte[] / rectangle cached in TileState: 1 threads and 20 loops per thread, time: 3610 ms 2 threads and 20 loops per thread, time: 4481 ms 4 threads and 20 loops per thread, time: 8225 ms 1 threads and 20 loops per thread, time: 54651 ms 2 threads and 20 loops per thread, time: 74516 ms 4 threads and 20 loops per thread, time: 140153 ms So you may be right, results are varying depending on the optimizations (int4, byte or all) ! Maybe I should test different versions on optimized pisces renderer ... Here is an updated patch: http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/ 2/ Thanks for your comments: actually a refactoring is possible to use a (shared) TileState instance replacing int[] bbox, rectangle bbox): - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox) it is very interesting here to propose an extensible tile state: maybe created by the renderer engine to cache other data ? - Rectangle and Rectangle2D are only used as the shape s and device rectangle given to CompositePipe.startSequence(): public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, int[] abox); Changing this interface may become difficult: AlphaColorPipe.java: 41: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, OK, [s, dev, abox] unused AlphaPaintPipe.java 81: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, create a paint context: PaintContext paintContext = sg.paint.createContext(sg.getDeviceColorModel(), devR, s.getBounds2D(), sg.cloneTransform(), sg.getRenderingHints()); GeneralCompositePipe.java: 62: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, abox unused and create a paint context: PaintContext paintContext = sg.paint.createContext(model, devR, s.getBounds2D(), sg.cloneTransform(), hints); SpanClipRenderer.java 68: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, Forward to another composite pipe return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox)); It could be possible to use TileState into PaintContext interface / fix implementations but it may become a tricky change (API change). What do you think ? Laurent 2013/4/11 Jim Graham james.gra...@oracle.com I'm pretty familiar with all of this code and there aren't any places that save the tile array that I remember. The embedded code that Pisces was taken from had some caching of alpha arrays, but we didn't use or keep that when we converted it for use in the JDK... It occurs to me that since you are collecting the various pieces of information into an object to store in the thread local storage, perhaps we should convert to a paradigm where an entire Tile Generation sequence uses that object TileState? as its main way to communicate info around the various stages. Thus, you don't really need an int[4] to store the 4 parameters, they could be stored directly in the TileState object. This would require more sweeping changes to the pipeline, but it might make the code a bit more readable (and make the hits to convert over more moot as they would be improving readability
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Hi Laurent, Yes, these kinds of minor optimizations (i.e. optimizations that don't make a clear 2x type of savings) can be frustrating at times. It looks like there is potential for a decent return there if we can find the right change. Sometimes rearranging a couple of things that don't look like they are saving work can somehow trip the runtime into executing the code more efficiently. I skimmed through your thoughts at the bottom. It occurred to me after I sent that idea out that sometimes we use int[] because we have to hand the values to native for return values and there is no easy way to return 4 values from a native method. An array is simplest because it can be loaded with answers via a single JNI call. 4 fields in a class would require 4xJNI.SetField calls. It might have better payoff if we can cache renderer state there as well which gets into subclassing. Also, doing this right may have to be done by someone here at Oracle because it may involve modifying the Ductus pipeline to match (it's been a while and I don't remember if we open sourced the code that interfaces Ductus to the RenderingEngine interfaces...?) ...jim On 4/11/13 6:07 AM, Laurent Bourgès wrote: Jim and Sergey, 1/ Here are few benchmarks (based on mapBench again) running several modified versions of AAShapePipe: http://jmmc.fr/~bourgesl/share/AAShapePipe/mapBench/ - ref: 1 threads and 20 loops per thread, time: 3742 ms 2 threads and 20 loops per thread, time: 4756 ms 4 threads and 20 loops per thread, time: 8528 ms 1 threads and 20 loops per thread, time: 56264 ms 2 threads and 20 loops per thread, time: 75566 ms 4 threads and 20 loops per thread, time: 141546 ms - int4: 1 threads and 20 loops per thread, time: 3653 ms 2 threads and 20 loops per thread, time: 4684 ms 4 threads and 20 loops per thread, time: 8291 ms 1 threads and 20 loops per thread, time: 55950 ms 2 threads and 20 loops per thread, time: 74796 ms 4 threads and 20 loops per thread, time: 139924 ms - byte[]: 1 threads and 20 loops per thread, time: 3795 ms 2 threads and 20 loops per thread, time: 4605 ms 4 threads and 20 loops per thread, time: 8246 ms 1 threads and 20 loops per thread, time: 54961 ms 2 threads and 20 loops per thread, time: 72768 ms 4 threads and 20 loops per thread, time: 139430 ms - int4 / byte[] / rectangle cached in TileState: 1 threads and 20 loops per thread, time: 3610 ms 2 threads and 20 loops per thread, time: 4481 ms 4 threads and 20 loops per thread, time: 8225 ms 1 threads and 20 loops per thread, time: 54651 ms 2 threads and 20 loops per thread, time: 74516 ms 4 threads and 20 loops per thread, time: 140153 ms So you may be right, results are varying depending on the optimizations (int4, byte or all) ! Maybe I should test different versions on optimized pisces renderer ... Here is an updated patch: http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-2/ 2/ Thanks for your comments: actually a refactoring is possible to use a (shared) TileState instance replacing int[] bbox, rectangle bbox): - RenderingEngine.AATileGenerator getAATileGenerator(... int[] abox) it is very interesting here to propose an extensible tile state: maybe created by the renderer engine to cache other data ? - Rectangle and Rectangle2D are only used as the shape s and device rectangle given to CompositePipe.startSequence(): public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, int[] abox); Changing this interface may become difficult: AlphaColorPipe.java: 41: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle dev, OK, [s, dev, abox] unused AlphaPaintPipe.java 81: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, create a paint context: PaintContext paintContext = sg.paint.createContext(sg.getDeviceColorModel(), devR, s.getBounds2D(), sg.cloneTransform(), sg.getRenderingHints()); GeneralCompositePipe.java: 62: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, abox unused and create a paint context: PaintContext paintContext = sg.paint.createContext(model, devR, s.getBounds2D(), sg.cloneTransform(), hints); SpanClipRenderer.java 68: public Object startSequence(SunGraphics2D sg, Shape s, Rectangle devR, Forward to another composite pipe return new SCRcontext(ri, outpipe.startSequence(sg, s, devR, abox)); It could be possible to use TileState into PaintContext interface / fix implementations but it may become a tricky change (API change). What do you think ? Laurent 2013/4/11 Jim Graham james.gra...@oracle.com I'm pretty familiar with all of this code and there aren't any places that save the tile array that I remember. The embedded code
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Andrea, I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT = 4 virtual cpus) on linux 64 (fedora 14). Note: I always use cpufrequtils to set the cpu governor to performance and use fixed frequency = 2.8Ghz: [bourgesl@jmmc-laurent ~]$ uname -a Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP Wed Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux 2013/4/10 Andrea Aime andrea.a...@geo-solutions.it On Tue, Apr 9, 2013 at 7:34 PM, Laurent Bourgès bourges.laur...@gmail.com wrote: Also, this should be tested on multiple platforms, preferably Linux, Windows and Mac to see how it is affected by differences in the platform runtimes and threading (hopefully minimal). It appears more difficult for me: I can use at work a mac 10.8 and I can run Windows XP within virtual box (but it is not very representative). I believe I can run MapBench on my Linux 64bit box during the next weekend, that would add a platform, and one were the server side behavior is enabled by default. And hopefully run the other benchmark as well. I also run j2DBench but I can try also Java2D.demos to perform regression tests. Laurent, have you made any changes to MapBench since I've sent it to you? Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics) and added explicit GC before tests (same initial conditions): http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/ Look at MapBench-src.ziphttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zipfor test changes. Thanks for your efforts, Laurent
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Dear Jim, 2013/4/9 Jim Graham james.gra...@oracle.com The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim I advocate I like GC because it avoids in Java dealing with pointers like C/C++ does; however, I prefer GC clean real garbage (application...) than wasted memory: I prefer not count on GC when I can avoid wasting memory that gives GC more work = reduce useless garbage (save the planet) ! Moreover, GC and / or Thread local allocation (TLAB) seems to have more overhead than you think = fast generation that costs almost nothing to allocate from and nearly nothing to clean up. Here are my micro-benchmark results related to int[4] allocation where I mimic the AAShapePipe.fillParallelogram() method: Patch Ref Gain 5,96 8,27 138,76% 7,31 14,96 204,65% 10,65 20,4 191,55% 15,44 29,83 193,20% Test environment: Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus) JVM settings: -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m Benchmark code (using Peter Levart microbench classes): http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ My conclusion is: nothing zero (allocation + cleanup) and it is very noticeable in multi threading tests. I advocate that I use a dirty int[4] array (no cleanup) but it is not necessary : maybe the performance gain come from that reason. Finally here is the output with -XX:+PrintTLAB flag: TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,065600KB refills: 20 waste 1,2% gc: 323712B slow: 600B fast: 0B TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,065600KB refills: 7 waste 7,9% gc: 745568B slow: 176B fast: 0B TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,065600KB refills: 15 waste 3,1% gc: 618464B slow: 448B fast: 0B TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,065600KB refills: 7 waste 0,0% gc: 0B slow: 232B fast: 0B TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,065600KB refills: 1 waste 27,5% gc: 369088B slow: 0B fast: 0B TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0 waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B I would have expected that TLAB can recycle all useless int[4] arrays as fast as possible = waste = 100% ??? *Is there any bug in TLAB (core-libs) ? Should I send such issue to hotspot team ? * *Test using ThreadLocal AAShapePipeContext:* { AAShapePipeContext ctx = getThreadContext(); int abox[] = ctx.abox; // use array: // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip, abox); abox[0] = 7; abox[1] = 11; abox[2] = 13; abox[3] = 17; // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox); devNull1.yield(abox); if (!useThreadLocal) { restoreContext(ctx); } } -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: *1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] **4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [17,05 (586353618), 19,23
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Hi Laurent, Could you disable tiered compilation for performance tests? Tiered compilation is usually a source of jitter in the results. Pass -XX:-TieredCompilation to the VM. Regards, Peter On 04/10/2013 10:58 AM, Laurent Bourgès wrote: Dear Jim, 2013/4/9 Jim Graham james.gra...@oracle.com mailto:james.gra...@oracle.com The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim I advocate I like GC because it avoids in Java dealing with pointers like C/C++ does; however, I prefer GC clean real garbage (application...) than wasted memory: I prefer not count on GC when I can avoid wasting memory that gives GC more work = reduce useless garbage (save the planet) ! Moreover, GC and / or Thread local allocation (TLAB) seems to have more overhead than you think = fast generation that costs almost nothing to allocate from and nearly nothing to clean up. Here are my micro-benchmark results related to int[4] allocation where I mimic the AAShapePipe.fillParallelogram() method: Patch Ref Gain 5,968,27138,76% 7,3114,96 204,65% 10,65 20,4191,55% 15,44 29,83 193,20% Test environment: Linux64 with OpenJDK8 (2 real cpu cores, 4 virtual cpus) JVM settings: -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -Xms128m -Xmx128m Benchmark code (using Peter Levart microbench classes): http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ http://jmmc.fr/%7Ebourgesl/share/AAShapePipe/microbench/ My conclusion is: nothing zero (allocation + cleanup) and it is very noticeable in multi threading tests. I advocate that I use a dirty int[4] array (no cleanup) but it is not necessary : maybe the performance gain come from that reason. Finally here is the output with -XX:+PrintTLAB flag: TLAB: gc thread: 0x7f105813d000 [id: 4053] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,0 65600KB refills: 20 waste 1,2% gc: 323712B slow: 600B fast: 0B TLAB: gc thread: 0x7f105813a800 [id: 4052] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,0 65600KB refills: 7 waste 7,9% gc: 745568B slow: 176B fast: 0B TLAB: gc thread: 0x7f1058138800 [id: 4051] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,0 65600KB refills: 15 waste 3,1% gc: 618464B slow: 448B fast: 0B TLAB: gc thread: 0x7f1058136800 [id: 4050] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,0 65600KB refills: 7 waste 0,0% gc: 0B slow: 232B fast: 0B TLAB: gc thread: 0x7f1058009000 [id: 4037] desired_size: 1312KB slow allocs: 0 refill waste: 20992B alloc: 1,0 65600KB refills: 1 waste 27,5% gc: 369088B slow: 0B fast: 0B TLAB totals: thrds: 5 refills: 50 max: 20 slow allocs: 0 max 0 waste: 3,1% gc: 2056832B max: 745568B slow: 1456B max: 600B fast: 0B max: 0B I would have expected that TLAB can recycle all useless int[4] arrays as fast as possible = waste = 100% ??? *Is there any bug in TLAB (core-libs) ? Should I send such issue to hotspot team ? * *Test using ThreadLocal AAShapePipeContext:* { AAShapePipeContext ctx = getThreadContext(); int abox[] = ctx.abox; // use array: // mimic: AATileGenerator aatg = renderengine.getAATileGenerator(x, y, dx1, dy1, dx2, dy2, 0, 0, clip, abox); abox[0] = 7; abox[1] = 11; abox[2] = 13; abox[3] = 17; // mimic: renderTiles(sg, computeBBox(ux1, uy1, ux2, uy2), aatg, abox); devNull1.yield(abox); if (!useThreadLocal) { restoreContext(ctx); } } -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: *1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Peter, 1/ I modified your TestRunner class to print total ops and perform explicit GC before runTests: http://jmmc.fr/~bourgesl/share/AAShapePipe/microbench/ 2/ I applied your advice but it does not change much: -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #- # ContextGetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 13,84 ns/op (σ = 0,23 ns/op), Total ops = 2889056179 [13,93 (717199825), 13,87 (720665624), 13,48 (741390545), 14,09 (709800185)] # 4 threads, Tavg = 14,25 ns/op (σ = 0,57 ns/op), Total ops = 2811615084 [15,21 (658351236), 14,18 (706254551), 13,94 (718202949), 13,74 (728806348)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 5,96 ns/op (σ = 0,00 ns/op), Total ops = 1678357614 [ 5,96 (1678357614)] 2 threads, Tavg = 7,33 ns/op (σ = 0,03 ns/op), Total ops = 2729723450 [ 7,31 (1369694121), 7,36 (1360029329)] 3 threads, Tavg = 10,65 ns/op (σ = 2,73 ns/op), Total ops = 2817154340 [13,24 (755190111), 13,23 (755920429), 7,66 (1306043800)] 4 threads, Tavg = 15,44 ns/op (σ = 3,33 ns/op), Total ops = 2589897733 [17,05 (586353618), 19,23 (519345153), 17,88 (559401974), 10,81 (924796988)] -XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=134217728 -XX:+PrintCommandLineFlags -XX:-PrintFlagsFinal -XX:-TieredCompilation -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC JVM START: 1.8.0-internal [OpenJDK 64-Bit Server VM 25.0-b24] #- # GetInt4: run duration: 10 000 ms # # Warm up: # 4 threads, Tavg = 31,56 ns/op (σ = 0,43 ns/op), Total ops = 1267295706 [31,30 (319512554), 31,02 (32229), 32,12 (311334550), 31,82 (314155269)] # 4 threads, Tavg = 30,75 ns/op (σ = 1,81 ns/op), Total ops = 1302123211 [32,21 (310949394), 32,37 (309275124), 27,87 (359125007), 31,01 (322773686)] cleanup (explicit Full GC) ... cleanup done. # Measure: 1 threads, Tavg = 8,36 ns/op (σ = 0,00 ns/op), Total ops = 1196238323 [ 8,36 (1196238323)] 2 threads, Tavg = 14,95 ns/op (σ = 0,04 ns/op), Total ops = 1337648720 [15,00 (666813210), 14,91 (670835510)] 3 threads, Tavg = 20,65 ns/op (σ = 0,99 ns/op), Total ops = 1453119707 [19,57 (511100480), 21,97 (455262170), 20,54 (486757057)] 4 threads, Tavg = 30,76 ns/op (σ = 0,54 ns/op), Total ops = 1301090278 [31,51 (317527231), 30,79 (324921525), 30,78 (325063322), 29,99 (333578200)] # JVM END 3/ I tried several heap settings: without Xms/Xmx ... but it has almost no effect. *Should I play with TLAB resize / initial size ? or different GC collector (G1 ...) ? Does anybody can explain me what PrintTLAB mean ?* Laurent 2013/4/10 Peter Levart peter.lev...@gmail.com Hi Laurent, Could you disable tiered compilation for performance tests? Tiered compilation is usually a source of jitter in the results. Pass -XX:-TieredCompilation to the VM. Regards, Peter On 04/10/2013 10:58 AM, Laurent Bourgès wrote: Dear Jim, 2013/4/9 Jim Graham james.gra...@oracle.com The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim I advocate I like GC because it avoids in Java dealing with pointers like C/C++ does; however, I prefer GC clean real garbage (application...) than wasted memory: I prefer not count on GC when I can avoid wasting memory that gives GC more work = reduce useless garbage (save the planet) ! Moreover, GC and / or Thread local allocation (TLAB) seems to have more overhead than you think = fast
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
On Wed, Apr 10, 2013 at 10:14 AM, Laurent Bourgès bourges.laur...@gmail.com wrote: Andrea, I am running benchmarks on my laptop (i7 - 2 core 2.8Ghz + HT = 4 virtual cpus) on linux 64 (fedora 14). Note: I always use cpufrequtils to set the cpu governor to performance and use fixed frequency = 2.8Ghz: [bourgesl@jmmc-laurent ~]$ uname -a Linux jmmc-laurent.obs.ujf-grenoble.fr 2.6.35.14-106.fc14.x86_64 #1 SMP Wed Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Yes, I did the same when I initially run the MapBench on JDK 7 vs OpenJDK 7 (governor settings wise). Since you are already running on that platform, maybe I can try to cover Linux 32 bit instead, I also have a notebook with that setup. Laurent, have you made any changes to MapBench since I've sent it to you? Yes I fixed a bit (cached BasicStroke, reused BufferedImage / Graphics) and added explicit GC before tests (same initial conditions): http://jmmc.fr/~bourgesl/share/java2d-pisces/MapBench/ Look at MapBench-src.ziphttp://jmmc.fr/%7Ebourgesl/share/java2d-pisces/MapBench/MapBench-src.zipfor test changes. Thanks Cheers Andrea -- == GeoServer training in Milan, 6th 7th June 2013! Visit http://geoserver.geo-solutions.it for more information. == Ing. Andrea Aime @geowolf Technical Lead GeoSolutions S.A.S. Via Poggio alle Viti 1187 55054 Massarosa (LU) Italy phone: +39 0584 962313 fax: +39 0584 1660272 mob: +39 339 8844549 http://www.geo-solutions.it http://twitter.com/geosolutions_it ---
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Hi, Laurent. I am not an expert here but just my 50 cents. This optimization shall take place only if it is really hotspot. But if it is a really hotspot - probably it would be better to remove these array/object allocation at all and use plane bytes? I see that some methods which take it as argument doesn't use them. And most of the time we pass AATileGenerator and abox[] to the same methods, so it could be merged? Also I suggest to use jmh for java micrbenchmarks. http://openjdk.java.net/projects/code-tools/jmh So your test will be: http://cr.openjdk.java.net/~serb/AAShapePipeBenchmark.java -- Best regards, Sergey.
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Sergey, I am not an expert here but just my 50 cents. This optimization shall take place only if it is really hotspot. But if it is a really hotspot - probably it would be better to remove these array/object allocation at all and use plane bytes? Java2D calls AAShapePipe for each shape (line, rectangle ...) rendering so it is an hotspot for me for big drawings as it will depends on the drawing complexity (for example, Andrea MapBench can produce maps having more than 100 000 shapes per image ...) I see that some methods which take it as argument doesn't use them. And most of the time we pass AATileGenerator and abox[] to the same methods, so it could be merged? For now I did not want to modify the AAShapePipe signatures: abox[] is filled by AATileGenerator implementations (ductus, pisces, others) in order to have the shape bounds and render only tiles covering this area. Also I suggest to use jmh for java micrbenchmarks. http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh So your test will be: http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java Thanks, I will try it asap Laurent -- Best regards, Sergey.
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
On 4/10/13 11:46 PM, Laurent Bourgès wrote: I see that some methods which take it as argument doesn't use them. And most of the time we pass AATileGenerator and abox[] to the same methods, so it could be merged? For now I did not want to modify the AAShapePipe signatures: abox[] is filled by AATileGenerator implementations (ductus, pisces, others) in order to have the shape bounds and render only tiles covering this area. You still have to check all the places, where these objects are filled and used, and refactoring is a good start, no? Otherwise, how can you prove that these arrays are used as you would expect, These arrays could be stored like the cache or re-used for other purpose(if someone don't want to create new arrays). Probably it will be good to split all your changes / to a few CR. - cleanup - Some small changes which gave us most speedup - all other things. ?? Also I suggest to use jmh for java micrbenchmarks. http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh So your test will be: http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java Thanks, I will try it asap Laurent -- Best regards, Sergey. -- Best regards, Sergey.
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
I'm pretty familiar with all of this code and there aren't any places that save the tile array that I remember. The embedded code that Pisces was taken from had some caching of alpha arrays, but we didn't use or keep that when we converted it for use in the JDK... It occurs to me that since you are collecting the various pieces of information into an object to store in the thread local storage, perhaps we should convert to a paradigm where an entire Tile Generation sequence uses that object TileState? as its main way to communicate info around the various stages. Thus, you don't really need an int[4] to store the 4 parameters, they could be stored directly in the TileState object. This would require more sweeping changes to the pipeline, but it might make the code a bit more readable (and make the hits to convert over more moot as they would be improving readability and give more focus to the relationships between all of the various bits of data). Then it simply becomes a matter of managing the lifetime and allocation of the TileState objects which is a minor update to the newly refactored code. ...jim On 4/10/13 3:59 PM, Sergey Bylokhov wrote: On 4/10/13 11:46 PM, Laurent Bourgès wrote: I see that some methods which take it as argument doesn't use them. And most of the time we pass AATileGenerator and abox[] to the same methods, so it could be merged? For now I did not want to modify the AAShapePipe signatures: abox[] is filled by AATileGenerator implementations (ductus, pisces, others) in order to have the shape bounds and render only tiles covering this area. You still have to check all the places, where these objects are filled and used, and refactoring is a good start, no? Otherwise, how can you prove that these arrays are used as you would expect, These arrays could be stored like the cache or re-used for other purpose(if someone don't want to create new arrays). Probably it will be good to split all your changes / to a few CR. - cleanup - Some small changes which gave us most speedup - all other things. ?? Also I suggest to use jmh for java micrbenchmarks. http://openjdk.java.net/**projects/code-tools/jmhhttp://openjdk.java.net/projects/code-tools/jmh So your test will be: http://cr.openjdk.java.net/~**serb/AAShapePipeBenchmark.javahttp://cr.openjdk.java.net/%7Eserb/AAShapePipeBenchmark.java Thanks, I will try it asap Laurent -- Best regards, Sergey.
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Hi Laurent, Quick questions - which benchmarks were run before/after? I see a lot of benchmark running in your Pisces improvement thread, but but none here. Also, this should be tested on multiple platforms, preferably Linux, Windows and Mac to see how it is affected by differences in the platform runtimes and threading (hopefully minimal). Finally, Hotspot is supposed to deal very well for small thread-local allocations like the int[4] and Rectangle2D that you optimized. Was it necessary to cache those at all? I'm sure the statistics for the allocations show up in a memory profile, but that doesn't mean it is costing us anything - ideally such small allocations are as fast as free and having to deal with caching them in a context will actually lose performance. It may be that the tile caching saved enough that it might have masked unnecessary or detrimental changes for the smaller objects... ...jim On 4/5/2013 5:20 AM, Laurent Bourgès wrote: Dear java2d members, I figured out some troubles in java2d.pipe.AAShapePipe related to both concurrency memory usage: - concurrency issue related to static theTile field: only 1 tile is cached so a new byte[] is created for other threads at each call to renderTile() - excessive memory usage (byte[] for tile, int[] and rectangle): at each call to renderPath / renderTiles, several small objects are created (never cached) that leads to hundreds megabytes that GC must deal with Here are profiling screenshots: - 4 threads drawing on their own buffered image (MapBench test): http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png - excessive int[] / Rectangle creation: http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png Here is the proposed patch: http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/ I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max). As its memory footprint is very small, I recommend using ThreadLocal. Is it necessary to use Soft/Weak reference to avoid excessive memory usage for such cache ? Is there any class dedicated to such cache (ThreadLocal with cache eviction or ConcurrentLinkedQueue using WeakReference ?) ? I think it could be very useful at the JDK level to have such feature (ie a generic GC friendlycache ) Regards, Laurent
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Dear Jim, I advocated I only looked at the netbeans memory profiler's output: no more megabytes allocated ! The main question is: how to know how GC / hotspot deals with such small allocations ? Is there any JVM flag to enable to see real allocations as does jmap -histo. Quick questions - which benchmarks were run before/after? I see a lot of benchmark running in your Pisces improvement thread, but but none here. Agreed; I can try running j2dBench on this fix only. I generally run Andrea's MapBench as I appeared more complex and using multiple threads. Also, this should be tested on multiple platforms, preferably Linux, Windows and Mac to see how it is affected by differences in the platform runtimes and threading (hopefully minimal). It appears more difficult for me: I can use at work a mac 10.8 and I can run Windows XP within virtual box (but it is not very representative). Don't you have at oracle any test platform to perform such tests / benchmark ? Finally, Hotspot is supposed to deal very well for small thread-local allocations like the int[4] and Rectangle2D that you optimized. Was it necessary to cache those at all? I'm sure the statistics for the allocations show up in a memory profile, but that doesn't mean it is costing us anything - ideally such small allocations are as fast as free and having to deal with caching them in a context will actually lose performance. It may be that the tile caching saved enough that it might have masked unnecessary or detrimental changes for the smaller objects... I repeat my question: how can I know at runtime how hotspot optimizes AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ? is it explained somewhere (allocation size threshold) ? Maybe verbose:gc output may help ? Finally I spent a lot of time on pisces renderer and running MapBench to show performance gains. Thanks for your interesting feedback, Laurent On 4/5/2013 5:20 AM, Laurent Bourgčs wrote: Dear java2d members, I figured out some troubles in java2d.pipe.AAShapePipe related to both concurrency memory usage: - concurrency issue related to static theTile field: only 1 tile is cached so a new byte[] is created for other threads at each call to renderTile() - excessive memory usage (byte[] for tile, int[] and rectangle): at each call to renderPath / renderTiles, several small objects are created (never cached) that leads to hundreds megabytes that GC must deal with Here are profiling screenshots: - 4 threads drawing on their own buffered image (MapBench test): http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**byte_tile.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png - excessive int[] / Rectangle creation: http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_**int_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png http://jmmc.fr/~bourgesl/**share/AAShapePipe/AAShapePipe_** rectangle_bbox.pnghttp://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png Here is the proposed patch: http://jmmc.fr/~bourgesl/**share/AAShapePipe/webrev-1/http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/ I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max). As its memory footprint is very small, I recommend using ThreadLocal. Is it necessary to use Soft/Weak reference to avoid excessive memory usage for such cache ? Is there any class dedicated to such cache (ThreadLocal with cache eviction or ConcurrentLinkedQueue using WeakReference ?) ? I think it could be very useful at the JDK level to have such feature (ie a generic GC friendlycache ) Regards, Laurent
Re: [OpenJDK 2D-Dev] AAShapePipe concurrency memory waste
Hi Laurent, The allocations will always show up on a heap profiler, I don't know of any way of having them not show up if they are stack allocated, but I don't think that stack allocation is the issue here - small allocations come out of a fast generation that costs almost nothing to allocate from and nearly nothing to clean up. They are actually getting allocated and GC'd, but the process is optimized. The only way to tell is to benchmark and see which changes make a difference and which are in the noise (or, in some odd counter-intuitive cases, counter-productive)... ...jim On 4/9/2013 10:34 AM, Laurent Bourgès wrote: Dear Jim, I advocated I only looked at the netbeans memory profiler's output: no more megabytes allocated ! The main question is: how to know how GC / hotspot deals with such small allocations ? Is there any JVM flag to enable to see real allocations as does jmap -histo. Quick questions - which benchmarks were run before/after? I see a lot of benchmark running in your Pisces improvement thread, but but none here. Agreed; I can try running j2dBench on this fix only. I generally run Andrea's MapBench as I appeared more complex and using multiple threads. Also, this should be tested on multiple platforms, preferably Linux, Windows and Mac to see how it is affected by differences in the platform runtimes and threading (hopefully minimal). It appears more difficult for me: I can use at work a mac 10.8 and I can run Windows XP within virtual box (but it is not very representative). Don't you have at oracle any test platform to perform such tests / benchmark ? Finally, Hotspot is supposed to deal very well for small thread-local allocations like the int[4] and Rectangle2D that you optimized. Was it necessary to cache those at all? I'm sure the statistics for the allocations show up in a memory profile, but that doesn't mean it is costing us anything - ideally such small allocations are as fast as free and having to deal with caching them in a context will actually lose performance. It may be that the tile caching saved enough that it might have masked unnecessary or detrimental changes for the smaller objects... I repeat my question: how can I know at runtime how hotspot optimizes AAShapePipe code (allocations ...) ? Does hotspot can do stack allocation ? is it explained somewhere (allocation size threshold) ? Maybe verbose:gc output may help ? Finally I spent a lot of time on pisces renderer and running MapBench to show performance gains. Thanks for your interesting feedback, Laurent On 4/5/2013 5:20 AM, Laurent Bourgčs wrote: Dear java2d members, I figured out some troubles in java2d.pipe.AAShapePipe related to both concurrency memory usage: - concurrency issue related to static theTile field: only 1 tile is cached so a new byte[] is created for other threads at each call to renderTile() - excessive memory usage (byte[] for tile, int[] and rectangle): at each call to renderPath / renderTiles, several small objects are created (never cached) that leads to hundreds megabytes that GC must deal with Here are profiling screenshots: - 4 threads drawing on their own buffered image (MapBench test): http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___byte_tile.png http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png - excessive int[] / Rectangle creation: http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___int_bbox.png http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png http://jmmc.fr/~bourgesl/__share/AAShapePipe/AAShapePipe___rectangle_bbox.png http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png Here is the proposed patch: http://jmmc.fr/~bourgesl/__share/AAShapePipe/webrev-1/ http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/ I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max). As its memory footprint is very small, I recommend using ThreadLocal. Is it necessary to use Soft/Weak reference to avoid excessive memory usage for such cache ? Is there any class dedicated to such cache (ThreadLocal with cache eviction or ConcurrentLinkedQueue using WeakReference ?) ? I think it could be very useful at the JDK level to have such feature (ie a generic GC friendlycache ) Regards, Laurent
AAShapePipe concurrency memory waste
Dear java2d members, I figured out some troubles in java2d.pipe.AAShapePipe related to both concurrency memory usage: - concurrency issue related to static theTile field: only 1 tile is cached so a new byte[] is created for other threads at each call to renderTile() - excessive memory usage (byte[] for tile, int[] and rectangle): at each call to renderPath / renderTiles, several small objects are created (never cached) that leads to hundreds megabytes that GC must deal with Here are profiling screenshots: - 4 threads drawing on their own buffered image (MapBench test): http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_byte_tile.png - excessive int[] / Rectangle creation: http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_int_bbox.png http://jmmc.fr/~bourgesl/share/AAShapePipe/AAShapePipe_rectangle_bbox.png Here is the proposed patch: http://jmmc.fr/~bourgesl/share/AAShapePipe/webrev-1/ I applied a simple solution = use a ThreadLocal or ConcurrentLinkedQueue (see useThreadLocal flag) to cache one AAShapePipeContext per thread (2K max). As its memory footprint is very small, I recommend using ThreadLocal. Is it necessary to use Soft/Weak reference to avoid excessive memory usage for such cache ? Is there any class dedicated to such cache (ThreadLocal with cache eviction or ConcurrentLinkedQueue using WeakReference ?) ? I think it could be very useful at the JDK level to have such feature (ie a generic GC friendlycache ) Regards, Laurent