I've been playing around some the simple problem of:

    How to take 32 bit RGBA data on the client side and
    composite it against a drawable as fast as possible
    using RENDER.

One of the factors that influences the performance of this is
how many copies of the data must be made before we composite.
Some of the factors that influence that:

 * Are we running remote or locally?
 * Does RENDER support the source data format [1]
 * Where must the source be -- vid memory, system memory [2]
 * Does the hardware support the source data format [1]

We'll ignore the last two in this discussion, since:

 - Currently the source should be always be in system memory   
   for XFree86 whether using software or hardware compositing (see [2])
 - If hardware accelerating, we can convert to the hw format
   when passing the source data to the video card. The only case
   where hardware support for the data format would matter 
   (in terms of number of copies) would be if we could set up the 
   video card to composite with DMA directly from system memory.

On the client side, we need a copy:

 - If RENDER doesn't support the source data format.

 - to copy into a Shm buffer if we are running locally or
   to copy into the PutImage buffer if running remotely.

   (This copy can be combined with the conversion to a RENDER
   support format in the Shm case. In the remote case, that
   would be possible without changing the X protocol,
   but it's probably easier to simply make sure we support
   the source format.)

Being remote implies:

 - A copy from the source image into the PutImage buffer.
 - Network buffer copies on the source and destination sides

On the server side, we need a copy if:

 - We aren't using ShmPixmaps, so we have to copy the data into
   a pixmap/picture, then composite the pixmap/picture. (See 
   comment in [2] about "CompositeImage") 

So, the total number of copies if we are running locally is:

 1, whether or not RENDER supports the source data format -- we
 need to make a copy into a ShmPixmap, and can convert as we
 do that copy. 

There's no improvement to be made here in number of copies; support
for source formats would still improve performance, especially
when hardware acceleration is possible.

And if we are running remotely:

 2 + network overhead, if RENDER supports the source format
 3 + network overhead, if RENDER doesn't support the source format 

By supporting the source format and adding a CompositeImage operation,
we could reduce this to 1 + network overhead. Whether the extra
copy matters at all compared to 100mb network speeds could be
doubted. However, the side benefit in a CompositeImage of making
sure that the source does _not_ end up having to be pulled back
from video memory is probably significant.

So, improvements to RENDER I'd like to see:

 * Optimized software/hardware suppor for compositing (pix_format=xBGR 
   mask_format=Axxx), (pix_format=RGBx, mask_format=xxxA) with pix and mask
   pointing to the same buffer. Probably people with better
   initial choices of formats would like to see xRGB/Axxx
   added at the same time.

 * Some way of compositing source data from memory without risk
   of round-tripping it to the video card first, such as a 
   CompositeImage request

Regards,
                                        Owen

P.S. - As evidence of the potential for improvement, at 16bpp, no
       hardware acceleration, local, the best speeds I've gotten 
       so far (XFree86-4.1 server) are:

       RENDER (premultiply data in client, use ShmPixmap to push
               to server and composite):

        - Dest in video memory 1.05 Mpix/s
        - Dest in system memory 5.75 Mpix/s
 
       ShmGetImage, convert dest to 24 bit, composite againt dest, ShmPutImage:

        - Dest in video memory 2.71 Mpix/s
        - Dest in system memory 11.71 Mpix/s

       I could further speed up the GetImage/PutImage path by skipping
       the conversion to 24 bit, so I'd expect RENDER to at least
       be a _little_ faster :-). 
 
[1] The source data formats of interest for GdkRGB are
    _not_ premultiplied:

     - Server endian little: ABGR
     - Server endian big:    RGBA
 
    We can represent not premultiplied data in the framework
    of RENDER by using the data both as source and mask, though
    that's not possible with XFree86-4.1 and and seems to be
    unacceleratd/optimized with current CVS.

    It would easy to optimize this in the software compositing
    case. In the hardware compositing case, knowledge of 
    premultiplied formats would have to be added to XAA.

[2] Right now, to get decent performance, the source must be
    in system memory - we clearly need this for decent performance
    for the software fallback, and XAA also only supports this.
    For the local rendering case, we can enforce the source
    being in the right place by using a ShmPixmap. For the
    remote case, we have no way to enforce this, and given
    sufficient video ram, we'll get bad performance. 

    A CompositeImage request would be useful to fix 
    this, and would give optimum performance in the remote
    case. Alternatively, if people want to be able to cache
    source data on the server, you'd need some way of hinting
    to the server: "put this pixmap in the optimum place for
    compositing source data".

   




_______________________________________________
Render mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/render

Reply via email to