Hi Laurent, Thanks a lot for taking the time to test the deferred xrender pipeline. Especially since the proprietary nvdia driver is the only one of the accelerated xrender implementations I didnt test / benchmark against.
> On my linux laptop (i7 + nvidia quadro), xrender is already faster than the > opengl backend (jdk11) on my MapBench tests. > Finally, J2DBench results do not show any gain on tested cases. > SHM is slightly better on nvidia (driver supposed to disable it ?) or > XRBackend / XCB is more efficient with SHM handling. This is really interesting - it seems the proprietary nvidia driver is currently the only driver handling the current xrender operations well. Back in 2009 I've written a standalone C benchmark to stress the types of operations (JXRenderMark) performed by the xrender pipeline and I know the nvidia people had a look at it, great to see this actually turned out to be useful after all. I could live with no performance win on nvidia, but I defintivly would like to avoid regressions. Seems I have to get access to a machine equipped with nvidia gpu and test mapbench there. > Yesterday I looked at the OpenGL backend code and your new XRDeferedBackend > looks very closed to OGLRenderQueue (extends RenderQueue) so you may share > some code about the buffer queue ? > Moreover, OpenGL backend has a queue flusher although XRDeferedBackend has > not ! Exactly, the RendereQueue based pipelines actually buffer their own protocol which they "replay" later from a singel thread, whereas the deferred xrender pipeline directly generates X11 protocol and therefore avoids one level of indirection. So despite the similarities, the actual implementation differs quite a bit. > Does it mean that few buffered commands may be pending ... until the buffer > queue or texture is flushed ? The deferred Xrender pipeline behaves no different than the x11 or the "old" xrender pipeline one in this regard. The self generated protocol is flushed when someone calls into a native Xlib function, by the callback returnSocketCB()l > Here is my first conclusion: > - nvidia GPU (or drivers) are so fast & optimized that the XRender API > overhead is already very small in contrary to intel / AMD CPU that have > either slower GPU or less efficient drivers. > - anybody could test on other discrete GPU or recent CPU ? In this case the overhead is caused by drivers, GPU utilization for most/all of those workloads is typically minor. > Why not larger texture than 256x256 ? > Is it uploaded completely in GPU (compromise) ? or partially ? Uploaded is only the area occupied by mask data. 256x256 is configureable (at least in code), and was a compromise between SHM areas ni-flight and memory use. > Is alignment important (16 ?) in GPU ? ie padding in x / y axis may improve > performance ? > Idem for row interleaving ? is it important ? > Why not pack the tile as an 1D contiguous array ? For ShmPutImage it doesn't matter, for XPutImage this is exactly what the code in PutImage does. > I am a bit lost in how tiles are packed into the SHM_BUFFERS ... and why the > normal XPutImage is more complicated than in XRBackendNative. This is an optimization - since we have to copy the data to the socket anyway, we can use this copy-process to compensate for different scan between the mask-buffer and the width of the uploaded area (therefore data is copied to the socket line-by-line). > - how to pack more efficiently the tiles into larger textures (padding) in x > or XY directions ? use multiple textures (pyramid) ? This is an area which could need improvement. For now tiles are layed out in a row one after another util the remaining buffer-width < tile-width and a next row is started. > PS: I can share (later) my variant of your patch (as I slightly modified it) > to fix typos, debugs ... That would be great. Thanks again & best regards, Clemens