I updated both patched pisces code and benchmarks: http://jmmc.fr/~bourgesl/share/java2d-pisces/
Few results comparing ThreadLocal vs ConcurrentLinkedQueue usage: OpenJDK 8 PATCH ThreadLocal mode: Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser 1 threads and 20 loops per thread, time: 2671 ms 2 threads and 20 loops per thread, time: 3239 ms 4 threads and 20 loops per thread, time: 6043 ms OpenJDK 8 PATCH ConcurrentLinkedQueue mode: Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_boulder_2013-13-30-06-13-17.ser 1 threads and 20 loops per thread, time: 2779 ms 2 threads and 20 loops per thread, time: 3416 ms 4 threads and 20 loops per thread, time: 6153 ms Oracle JDK8 Ductus: Testing file /home/bourgesl/libs/openjdk/mapbench/dc_boulder_2013-13-30-06-13-17.ser 1 threads and 20 loops per thread, time: 1894 ms 2 threads and 20 loops per thread, time: 3905 ms 4 threads and 20 loops per thread, time: 7485 ms OpenJDK 8 PATCH ThreadLocal mode: Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser 1 threads and 20 loops per thread, time: 24211 ms 2 threads and 20 loops per thread, time: 30955 ms *4 threads and 20 loops per thread, time: 67715 ms* OpenJDK 8 PATCH ConcurrentLinkedQueue mode: Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser 1 threads and 20 loops per thread, time: 25984 ms 2 threads and 20 loops per thread, time: 33131 ms *4 threads and 20 loops per thread, time: 75343 ms * Oracle JDK8 Ductus: Loading drawing commands from file: /home/bourgesl/libs/openjdk/mapbench/dc_shp_alllayers_2013-00-30-07-00-47.ser Loaded DrawingCommands: DrawingCommands{width=1400, height=800, commands=135213} 1 threads and 20 loops per thread, time: 20911 ms 2 threads and 20 loops per thread, time: 39297 ms 4 threads and 20 loops per thread, time: 103392 ms ConcurrentLinkedQueue add a small overhead but not too much vs ThreadLocal. Is it possible to test efficiently if the current thread is EDT then I could use ThreadLocal for EDT at least ? it must be very fast because getThreadContext() is called once per rendering operation so it is a performance bottleneck. For example: Testing file /home/bourgesl/libs/openjdk/mapbench/test/dc_shp_alllayers_2013-00-30-07-00-47.ser TL: 4 threads and 20 loops per thread, time: 67715 ms CLQ: 4 threads and 20 loops per thread, time: 75343 ms Changes: - use ThreadLocal or ConcurrentLinkedQueue<RendererContext> to get a renderer context (vars / cache) - use first RendererContext (dirty / clean arrays) members instead of using IntArrayCache / FloatArrayCache for performance reasons (dedicated to large dynamic arrays) TBD: - recycle pisces class i.e. keep only one instance per class (Renderer, Stroker ...) to avoid totally GC overhead (several thousands per MapBench test). Moreover, these are very small objects / short lived i.e. l so it should stay in ThreadLocalAllocator (TLAB) but when I use verbose:gc or jmap -histo these are present and represents megabytes: [bourgesl@jmmc-laurent ~]$ jmap -histo:live 21628 | grep pisces 5: 50553 6470784 sun.java2d.pisces.Renderer 9: 29820 3578400 sun.java2d.pisces.Stroker 11: 49795 3186880 sun.java2d.pisces.PiscesCache 12: 49794 1991760 sun.java2d.pisces.PiscesTileGenerator 13: 49793 1991720 sun.java2d.pisces.Renderer$ScanlineIterator 14: 29820 1431360 sun.java2d.pisces.PiscesRenderingEngine$NormalizingPathIterator 52: 40 1280 sun.java2d.pisces.IntArrayCache 94: 20 640 sun.java2d.pisces.FloatArrayCache 121: 8 320 [Lsun.java2d.pisces.IntArrayCache; 127: 4 320 sun.java2d.pisces.RendererContext 134: 4 256 sun.java2d.pisces.Curve 154: 4 160 [Lsun.java2d.pisces.FloatArrayCache; 155: 4 160 sun.java2d.pisces.RendererContext$RendererData 156: 4 160 sun.java2d.pisces.RendererContext$StrokerData 157: 4 160 sun.java2d.pisces.Stroker$PolyStack 208: 3 72 sun.java2d.pisces.PiscesRenderingEngine$NormMode 256: 1 32 [Lsun.java2d.pisces.PiscesRenderingEngine$NormMode; 375: 1 16 sun.java2d.pisces.PiscesRenderingEngine 376: 1 16 sun.java2d.pisces.RendererContext$1 Regards, Laurent 2013/4/3 Laurent Bourgès <bourges.laur...@gmail.com> > Thanks for your valueable feedback! > > Here is the current status of my patch alpha version: >>> http://jmmc.fr/~bourgesl/share/java2d-pisces/ >>> >>> There is still a lot to be done: clean-up, stats, pisces class instance >>> recycling (renderer, stroker ...) and of course sizing correctly initial >>> arrays (dirty or clean) in the RendererContext (thread local storage). >>> For performance reasons, I am using now RendererContext members first >>> (cache for rowAARLE for example) before using ArrayCaches (dynamic arrays). >>> >> >> Thank you Laurent, those are some nice speedups. >> > I think it can still be improved: I hope to make it as fast as ductus or > maybe more (I have several idea for aggressive optimizations) but the main > improvement consist in reusing memory (like C / C++ does) to avoid wasted > memory / GC overhead in concurrent environment. > > >> About the thread local storage, that is a sensible choice for highly >> concurrent systems, at the same time, web containers normally complain about >> orphaned thread locals created by an application and not cleaned up. >> Not sure if ones created at the core libs level get special treatment, >> but in general, I guess it would be nice to have some way to clean them up. >> > > You're right that's why my patch is not ready ! > > I chose ThreadLocal for simplicity and clarity but I see several issues: > 1/ Web container: ThreadLocal must be clean up when stopping an > application to avoid memory leaks (application becomes unloadable due to > classloader leaks) > 2/ ThreadLocal access is the fastest way to get the RendererContext as it > does not require any lock (unsynchronized); As I get the RendererContext > once per rendering request, I think the ThreadLocal can be replaced by a > thread-safe ConcurrentLinkedQueue<RendererContext> but it may become a > performance bootleneck > 3/ Using a ConcurrentLinkedQueue<RendererContext> requires an efficient / > proper cache eviction to free memory (Weak or Soft references ?) or using > statistics (last usage timestamp, usage counts) > > Any other idea (core-libs) to have an efficient thread context in a web > container ? > > I'm not familiar with the API, but is there any way to clean them up when >> the graphics2d gets disposed of? >> > > The RenderingEngine is instanciated by the JVM once and I do not see in > the RenderingEngine interface any way to perform callbacks for warmup / > cleanup ... nor access to the Graphics RenderingHints (other RFE for tuning > purposes) > > >> A web application has no guarantee to see the same thread ever again >> during his life, so thread locals have to be cleaned right away. >> > > I advocate ThreadLocal can lead to wasted memory as only few concurrent > threads can really use their RendererContext instance while others can > simply answer web requests => let's use a > ConcurrentLinkedQueue<RendererContext> with a proper cache eviction. > > >> >> Either that, or see if there is any way to store the array caches in a >> global structure backed by a concurrent collection to reduce/eliminate >> contention. >> > > Yes, it is a interesting alternative to benchmark. > > Regards, > Laurent >