Hi Clemens, As I am enjoying winter holidays, I will try your patch once at home.
It seems very promising and will try understanding changes to C code. I will also test on my linux machines with nvidia cards (quadro 610 & 1070). Cheers, Laurent Le 22 févr. 2018 8:42 AM, "Clemens Eisserer" <linuxhi...@gmail.com> a écrit : > Hi, > > After achieving huge speedups with Marlin Laurent Bourgès recently > proposed increasing the AA tile size of MaskBlit/MaskFill operations. > The 128x64 tiles size should help the Xrender pipeline a lot for > larger aa shapes. For smaller xrender stays rather slow. > > To solve this issue am currently working on batching the AA tile mask > uploads in the xrender pipeline to improve the performance with > antialiasing enabled. > Batching can happen regardless of state-changes, so different shapes > with different properties can all be uploaded in one batch. > Furthermore that batching (resulting in larger uploads) allows for > mask upload using XShm (shared memory), reducing the number of data > copies and context switches. > > Initial results seem very promising - beating the current OpenGL > implementation by a wide margin: > > J2DBench, 20x20 ellipse antialiased: > > XRender + deferred mask upload + XSHM: > > Test(graphics.render.tests.fillOval) averaged > > 3.436728470039390E7 pixels/sec > > with width1, !clip, Default, !alphacolor, ident, > > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to > > VolatileImg(Opaque) > > XRender + deferred mask upload: > > Test(graphics.render.tests.fillOval) averaged > > 3.0930638830897704E7 pixels/sec > > with width1, !clip, Default, !alphacolor, ident, > > !extraalpha, single, !xormode, antialias, SrcOver, 20x20, bounce, to > > VolatileImg(Opaque) > > OpenGL pipeline: > > Test(graphics.render.tests.fillOval) averaged > > 1.3258861545909312E7 pixels/sec > > with Default, !xormode, !extraalpha, single, bounce, > > 20x20, to VolatileImg(Opaque), ident, !clip, !alphacolor, antialias, > > SrcOver, width1 > > XRender as-is: > > Test(graphics.render.tests.fillOval) averaged > > 6031195.796094009 pixels/sec > > with !alphacolor, bounce, !extraalpha, !xormode, > > antialias, Default, single, ident, SrcOver, 20x20, to > > VolatileImg(Opaque), !clip, width1 > > > And a real-world test: MigLayout Swing Benchmark with NimbusLnf, ms > for one iteration: > > XRender-Deferred + SHM: > AMD: 850 ms > Intel: 1300 ms > > OpenGL: > AMD: 1260 ms > Intel: 2580 ms > > XRender (as is): > AMD: 2620 ms > Intel: 4690 ms > > (AMD: AMD Kaveri 7650k / Intel: Intel Core i5 640M ) > > > It is still in prototype state with a few rough edges and a few > corner-cases unimplemented (e.g. extra alpha with antialiasing), > but should be able to run most workloads: > http://93.83.133.214/webrev/ > https://sourceforge.net/p/xrender-deferred/code/ref/default/ > > It is disabled by default, and can be enabled with > -Dsun.java2d.xr.deferred=true > Shm upload is enabled with deferred and can be disabled with: > -Dsun.java2d.xr.shm=false > > What would be the best way forward? > Would this have a chance to get into OpenJDK11 for platforms eith > XCB-based Xlib implementations? > Keeping in mind the dramatic performance increase, > even outperforming the current OpenGL pipeline, I really hope so. > > Another change I would hope to see is a modification of the > maskblit/maskfill interfaces. > For now marlin has to rasterize into a byte[] tile, this array is > afterwards passed to the pipeline, > and the pipeline itself has to copy it again into some internal buffer. > With the enhancements described above, I see this copy process already > consuming ~5-10% of cpu cycles. > Instead the pipeline could provide a ByteBuffer to rasterize into to > Marlin, along with information regarding stride/width/etc. > > Best regards, Clemens > > Some background regarding the issue / implementation: > > Since the creation of the xrender java2d backend, I was always > bothered how poor it performed with antialiasing enabled. > What the xrender backend does in this situation seems not to be that > common - the modern drivers basically stall the GPU for every single > AA tile (currently 32x32). > > Pisces was so slow, xservers could consume the tiles more or less at > the speed pisces provided it. > However with the excellent work on Pisces's successor Marlin (big > thanks to Laurent Bourgès), the bottleneck the xrender pipeline > presented was more and more evident. > > One early approach to solve this weakness was to implement the AA > primitives using a modified version of Cairo, > sending a list of trapezoids to the x-server instead of the AA coverage > masks. > However this approach has it's own performance issues (and is > considered hard to GPU-accelerate) and finally because of the > maintenance burden the idea was dropped. > > The root of all evil is the immediate nature of Java2D: > Java2D calls into the backends with 32x32 tiles and expects them to > "immediatly" perform a bleding operation with the 32x32px alpha mask > provided. > In the xrender pipeline, this results in a XPutImage call for > uploading the coverage mask immediatly followed by an XRenderComposite > call performing the blending. > This means: > - a lot of traffic on the X11 protocol socket for transferring the > mask data -> context switches > - a lot of GPU stalls, because the uploaded data from system-memory is > immediatly used as input for the GPU operation > - a lot of driver/GPU state invalidation, because various different > operations are mixed > > What would help in this situation would be to combine all those small > RAM->VRAM uploads into a larger one, > followed by a series of blending operations. > So instead of: while(moreTiles) {XPutImage(32x32); > XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles) > {XRenderComposite(32x32)}; > > long story short: using xcb's socket handoff functionality this can be > done: https://lists.debian.org/debian-x/2008/10/msg00209.html > Socket handoff gives the user the control when to submit protocol to > the XServer (so the XRenderComposite commands can be queued without > beeing actually executed), while the AA tiles are buffered in a larger > marks - and before the XRenderComposite commands are sent to the > XServer we simply prepend the single, large XPutImage operation in > front. > > The tradeoff is, during the socket is taken, the application has to > generate all the X11 protocol by itself - which means quite a bit new > code. > Every X function not implemented our own, will cause the socket to be > revoked, which incurs overhead and limites the timeframe batching can > be applied. > The good new is we don't have to handle every corner case - for > uncommon requests we simply fall back to the previous implementation, > xlib would grab the socket and the request would be generated in native > code. > > The implementation is careful not to introduce additional overhead > (except from a single additional if + method-call per primitive) in > cases where no antialiasing is used. > In case no MaskFill/Blit operations are enqueued, the old code-paths > are used exclusivly, without any change in operations. > > Shm is done with 4 independent regions inside a single XShmImage. > After a region has been queued for upload using XShmPutImage, a > GetInputFocus request is queued - when the reply comes in, the > pipeline knows the region can be re-used again. > In case all regions are in-flight, the pipeline will gracefully > degrade to a normal XPutImage, which has the nice properties of not > introducing any sync overhead and cleaning the command-stream to get > the pending ShmPutImage operations processed. >