Hi Clemens, Sorry this is a long email giving my feedback on your xrender efforts.
After achieving huge speedups with Marlin Laurent Bourgès recently > proposed increasing the AA tile size of MaskBlit/MaskFill operations. > The 128x64 tiles size should help the Xrender pipeline a lot for > larger aa shapes. For smaller xrender stays rather slow. > Thanks. On my linux laptop (i7 + nvidia quadro), xrender is already faster than the opengl backend (jdk11) on my MapBench tests. > To solve this issue am currently working on batching the AA tile mask > uploads in the xrender pipeline to improve the performance with > antialiasing enabled. > Batching can happen regardless of state-changes, so different shapes > with different properties can all be uploaded in one batch. > Furthermore that batching (resulting in larger uploads) allows for > mask upload using XShm (shared memory), reducing the number of data > copies and context switches. > First impressions: I looked at your code and mostly understand it except how tiles are packed in larger texture (illustration is missing, please) & fence handling. Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend looks very closed to OGLRenderQueue (extends RenderQueue) so you may share some code about the buffer queue ? Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has not ! Does it mean that few buffered commands may be pending ... until the buffer queue or texture is flushed ? > > It is still in prototype state with a few rough edges and a few > corner-cases unimplemented (e.g. extra alpha with antialiasing), > but should be able to run most workloads: > http://93.83.133.214/webrev/ > https://sourceforge.net/p/xrender-deferred/code/ref/default/ > I will give you later more details about your code (pseudo-review), but I noticed that XRBackendNative uses putMaskNative (c) that seems more efficient than the XRDeferedBackend (mask copy in java + XPutImage in c)... > > It is disabled by default, and can be enabled with > -Dsun.java2d.xr.deferred=true > Shm upload is enabled with deferred and can be disabled with: > -Dsun.java2d.xr.shm=false > I merged your patch on latest jdk11 + pending marlin 0.9.1 patch and it works well (except extra-alpha is missing). > What would be the best way forward? > Would this have a chance to get into OpenJDK11 for platforms with > XCB-based Xlib implementations? > Keeping in mind the dramatic performance increase, > even outperforming the current OpenGL pipeline, I really hope so. > I made performance testing with nvidia hw (binary driver 390.12) Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz Nvidia Quadro M1000M 1/ J2DBench results (AA on all shapes with size = 20 & 250): options: bourgesl.github.io <https://github.com/bourgesl/bourgesl.github.io> /j2dbench <https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench>/ options <https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/options> /default_2018.opt Defered off vs on: ~ 1 to 15% slower http://bourgesl.github.io/j2dbench/xr_results/Summary_Report.html Defered enabled: SHM off vs on: ~ 3 to 10% faster http://bourgesl.github.io/j2dbench/xr_results_shm/Summary_Report.html See raw data: https://github.com/bourgesl/bourgesl.github.io/tree/master/j2dbench/ Finally, J2DBench results do not show any gain on tested cases. SHM is slightly better on nvidia (driver supposed to disable it ?) or XRBackend / XCB is more efficient with SHM handling. Perspectives: - test smaller shapes (size=1 with width=5) to increase the tile packing factor ? - how to pack more efficiently the tiles into larger textures (padding) in x or XY directions ? use multiple textures (pyramid) ? - optimize tile copies anyway or the queue flushing ? 2/ MapBench tests with -Dsun.java2d.xr.deferred=false/true: I found 2 cases with large gains (20% to 40% faster) whereas other maps have 10% losses: - dc_shp_alllayers_2013-00-30-07-00-47.ser {width=1400, height=800, commands=135213} Test Threads Ops Med Pct95 Avg StdDev Min Max FPS(med) [ms/op] *off*: dc_shp_alllayers_2013-00-30-07-00-47.ser 1 14 727.411 *728.847 * 727.394 1.127 725.197 729.833 1.375 *on*: dc_shp_alllayers_2013-00-30-07-00-47.ser 1 23 443.919 *486.207* 456.228 19.807 438.598 486.902 2.253 - test_z_625k.ser {width=1272, height=1261, commands=23345} Test Threads Ops Med Pct95 Avg StdDev Min Max FPS(med) [ms/op] *off*: test_z_625k.ser 1 96 108.856 *109.923* 108.915 0.588 107.886 111.762 9.186 *on*: test_z_625k.ser 1 113 90.908 *92.837* 91.021 1.067 89.029 96.558 11.000 These two cases are the most complex maps (many small shapes) so the tile packing is a big win (high tile count per texture upload and less uploads) Here is my first conclusion: - nvidia GPU (or drivers) are so fast & optimized that the XRender API overhead is already very small in contrary to intel / AMD CPU that have either slower GPU or less efficient drivers. - anybody could test on other discrete GPU or recent CPU ? Anyway I still think it is worth to go on improving this patch ... any idea is welcome ? Clemens, you could have a look to OpenJFX code as I remember OpenGL backend is more efficient (buffering + texture uploads) so we could get some ideas for improvements. > > Another change I would hope to see is a modification of the > maskblit/maskfill interfaces. > For now marlin has to rasterize into a byte[] tile, this array is > afterwards passed to the pipeline, > and the pipeline itself has to copy it again into some internal buffer. > With the enhancements described above, I see this copy process already > consuming ~5-10% of cpu cycles. > Instead the pipeline could provide a ByteBuffer to rasterize into to > Marlin, along with information regarding stride/width/etc. That sounds a good idea, but I must study the impact on other backends... > Some background regarding the issue / implementation: > > What would help in this situation would be to combine all those small > RAM->VRAM uploads into a larger one, > followed by a series of blending operations. > So instead of: while(moreTiles) {XPutImage(32x32); > XRenderComposite(32x32) } -> XPutImage(256x256); while(moreTiles) > {XRenderComposite(32x32)}; > > Why not larger texture than 256x256 ? Is it uploaded completely in GPU (compromise) ? or partially ? Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve performance ? Idem for row interleaving ? is it important ? Why not pack the tile as an 1D contiguous array ? > > Shm is done with 4 independent regions inside a single XShmImage. > After a region has been queued for upload using XShmPutImage, a > GetInputFocus request is queued - when the reply comes in, the > pipeline knows the region can be re-used again. > In case all regions are in-flight, the pipeline will gracefully > degrade to a normal XPutImage, which has the nice properties of not > introducing any sync overhead and cleaning the command-stream to get > the pending ShmPutImage operations processed. > I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why the normal XPutImage is more complicated than in XRBackendNative. PS: I can share (later) my variant of your patch (as I slightly modified it) to fix typos, debugs ... Cheers, Laurent