Hi, After achieving huge speedups with Marlin Laurent Bourgès recently proposed increasing the AA tile size of MaskBlit/MaskFill operations. The 128x64 tiles size should help the Xrender pipeline a lot for larger aa shapes. For smaller xrender stays rather slow.
To solve this issue am currently working on batching the AA tile mask uploads in the xrender pipeline to improve the performance with antialiasing enabled. Batching can happen regardless of state-changes, so different shapes with different properties can all be uploaded in one batch. Furthermore that batching (resulting in larger uploads) allows for mask upload using XShm (shared memory), reducing the number of data copies and context switches. Initial results seem very promising - beating the current OpenGL implementation by a wide margin: J2DBench, 20x20 ellipse antialiased: XRender + deferred mask upload + XSHM: > Test(graphics.render.tests.fillOval) averaged > 3.436728470039390E7 pixels/sec > with width1, !clip, Default, !alphacolor, ident, > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to > VolatileImg(Opaque) XRender + deferred mask upload: > Test(graphics.render.tests.fillOval) averaged > 3.0930638830897704E7 pixels/sec > with width1, !clip, Default, !alphacolor, ident, > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to > VolatileImg(Opaque) OpenGL pipeline: > Test(graphics.render.tests.fillOval) averaged > 1.3258861545909312E7 pixels/sec > with Default, !xormode, !extraalpha, single, bounce, > 20x20, to VolatileImg(Opaque), ident, !clip, !alphacolor, antialias, > SrcOver, width1 XRender as-is: > Test(graphics.render.tests.fillOval) averaged > 6031195.796094009 pixels/sec > with !alphacolor, bounce, !extraalpha, !xormode, > antialias, Default, single, ident, SrcOver, 20x20, to > VolatileImg(Opaque), !clip, width1 And a real-world test: MigLayout Swing Benchmark with NimbusLnf, ms for one iteration: XRender-Deferred + SHM: AMD: 850 ms Intel: 1300 ms OpenGL: AMD: 1260 ms Intel: 2580 ms XRender (as is): AMD: 2620 ms Intel: 4690 ms (AMD: AMD Kaveri 7650k / Intel: Intel Core i5 640M ) It is still in prototype state with a few rough edges and a few corner-cases unimplemented (e.g. extra alpha with antialiasing), but should be able to run most workloads: http://93.83.133.214/webrev/ https://sourceforge.net/p/xrender-deferred/code/ref/default/ It is disabled by default, and can be enabled with -Dsun.java2d.xr.deferred=true Shm upload is enabled with deferred and can be disabled with: -Dsun.java2d.xr.shm=false What would be the best way forward? Would this have a chance to get into OpenJDK11 for platforms eith XCB-based Xlib implementations? Keeping in mind the dramatic performance increase, even outperforming the current OpenGL pipeline, I really hope so. Another change I would hope to see is a modification of the maskblit/maskfill interfaces. For now marlin has to rasterize into a byte[] tile, this array is afterwards passed to the pipeline, and the pipeline itself has to copy it again into some internal buffer. With the enhancements described above, I see this copy process already consuming ~5-10% of cpu cycles. Instead the pipeline could provide a ByteBuffer to rasterize into to Marlin, along with information regarding stride/width/etc. Best regards, Clemens Some background regarding the issue / implementation: Since the creation of the xrender java2d backend, I was always bothered how poor it performed with antialiasing enabled. What the xrender backend does in this situation seems not to be that common - the modern drivers basically stall the GPU for every single AA tile (currently 32x32). Pisces was so slow, xservers could consume the tiles more or less at the speed pisces provided it. However with the excellent work on Pisces's successor Marlin (big thanks to Laurent Bourgès), the bottleneck the xrender pipeline presented was more and more evident. One early approach to solve this weakness was to implement the AA primitives using a modified version of Cairo, sending a list of trapezoids to the x-server instead of the AA coverage masks. However this approach has it's own performance issues (and is considered hard to GPU-accelerate) and finally because of the maintenance burden the idea was dropped. The root of all evil is the immediate nature of Java2D: Java2D calls into the backends with 32x32 tiles and expects them to "immediatly" perform a bleding operation with the 32x32px alpha mask provided. In the xrender pipeline, this results in a XPutImage call for uploading the coverage mask immediatly followed by an XRenderComposite call performing the blending. This means: - a lot of traffic on the X11 protocol socket for transferring the mask data -> context switches - a lot of GPU stalls, because the uploaded data from system-memory is immediatly used as input for the GPU operation - a lot of driver/GPU state invalidation, because various different operations are mixed What would help in this situation would be to combine all those small RAM->VRAM uploads into a larger one, followed by a series of blending operations. So instead of: while(moreTiles) {XPutImage(32x32); XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles) {XRenderComposite(32x32)}; long story short: using xcb's socket handoff functionality this can be done: https://lists.debian.org/debian-x/2008/10/msg00209.html Socket handoff gives the user the control when to submit protocol to the XServer (so the XRenderComposite commands can be queued without beeing actually executed), while the AA tiles are buffered in a larger marks - and before the XRenderComposite commands are sent to the XServer we simply prepend the single, large XPutImage operation in front. The tradeoff is, during the socket is taken, the application has to generate all the X11 protocol by itself - which means quite a bit new code. Every X function not implemented our own, will cause the socket to be revoked, which incurs overhead and limites the timeframe batching can be applied. The good new is we don't have to handle every corner case - for uncommon requests we simply fall back to the previous implementation, xlib would grab the socket and the request would be generated in native code. The implementation is careful not to introduce additional overhead (except from a single additional if + method-call per primitive) in cases where no antialiasing is used. In case no MaskFill/Blit operations are enqueued, the old code-paths are used exclusivly, without any change in operations. Shm is done with 4 independent regions inside a single XShmImage. After a region has been queued for upload using XShmPutImage, a GetInputFocus request is queued - when the reply comes in, the pipeline knows the region can be re-used again. In case all regions are in-flight, the pipeline will gracefully degrade to a normal XPutImage, which has the nice properties of not introducing any sync overhead and cleaning the command-stream to get the pending ShmPutImage operations processed.