I've been playing around some the simple problem of:
How to take 32 bit RGBA data on the client side and
composite it against a drawable as fast as possible
using RENDER.
One of the factors that influences the performance of this is
how many copies of the data must be made before we composite.
Some of the factors that influence that:
* Are we running remote or locally?
* Does RENDER support the source data format [1]
* Where must the source be -- vid memory, system memory [2]
* Does the hardware support the source data format [1]
We'll ignore the last two in this discussion, since:
- Currently the source should be always be in system memory
for XFree86 whether using software or hardware compositing (see [2])
- If hardware accelerating, we can convert to the hw format
when passing the source data to the video card. The only case
where hardware support for the data format would matter
(in terms of number of copies) would be if we could set up the
video card to composite with DMA directly from system memory.
On the client side, we need a copy:
- If RENDER doesn't support the source data format.
- to copy into a Shm buffer if we are running locally or
to copy into the PutImage buffer if running remotely.
(This copy can be combined with the conversion to a RENDER
support format in the Shm case. In the remote case, that
would be possible without changing the X protocol,
but it's probably easier to simply make sure we support
the source format.)
Being remote implies:
- A copy from the source image into the PutImage buffer.
- Network buffer copies on the source and destination sides
On the server side, we need a copy if:
- We aren't using ShmPixmaps, so we have to copy the data into
a pixmap/picture, then composite the pixmap/picture. (See
comment in [2] about "CompositeImage")
So, the total number of copies if we are running locally is:
1, whether or not RENDER supports the source data format -- we
need to make a copy into a ShmPixmap, and can convert as we
do that copy.
There's no improvement to be made here in number of copies; support
for source formats would still improve performance, especially
when hardware acceleration is possible.
And if we are running remotely:
2 + network overhead, if RENDER supports the source format
3 + network overhead, if RENDER doesn't support the source format
By supporting the source format and adding a CompositeImage operation,
we could reduce this to 1 + network overhead. Whether the extra
copy matters at all compared to 100mb network speeds could be
doubted. However, the side benefit in a CompositeImage of making
sure that the source does _not_ end up having to be pulled back
from video memory is probably significant.
So, improvements to RENDER I'd like to see:
* Optimized software/hardware suppor for compositing (pix_format=xBGR
mask_format=Axxx), (pix_format=RGBx, mask_format=xxxA) with pix and mask
pointing to the same buffer. Probably people with better
initial choices of formats would like to see xRGB/Axxx
added at the same time.
* Some way of compositing source data from memory without risk
of round-tripping it to the video card first, such as a
CompositeImage request
Regards,
Owen
P.S. - As evidence of the potential for improvement, at 16bpp, no
hardware acceleration, local, the best speeds I've gotten
so far (XFree86-4.1 server) are:
RENDER (premultiply data in client, use ShmPixmap to push
to server and composite):
- Dest in video memory 1.05 Mpix/s
- Dest in system memory 5.75 Mpix/s
ShmGetImage, convert dest to 24 bit, composite againt dest, ShmPutImage:
- Dest in video memory 2.71 Mpix/s
- Dest in system memory 11.71 Mpix/s
I could further speed up the GetImage/PutImage path by skipping
the conversion to 24 bit, so I'd expect RENDER to at least
be a _little_ faster :-).
[1] The source data formats of interest for GdkRGB are
_not_ premultiplied:
- Server endian little: ABGR
- Server endian big: RGBA
We can represent not premultiplied data in the framework
of RENDER by using the data both as source and mask, though
that's not possible with XFree86-4.1 and and seems to be
unacceleratd/optimized with current CVS.
It would easy to optimize this in the software compositing
case. In the hardware compositing case, knowledge of
premultiplied formats would have to be added to XAA.
[2] Right now, to get decent performance, the source must be
in system memory - we clearly need this for decent performance
for the software fallback, and XAA also only supports this.
For the local rendering case, we can enforce the source
being in the right place by using a ShmPixmap. For the
remote case, we have no way to enforce this, and given
sufficient video ram, we'll get bad performance.
A CompositeImage request would be useful to fix
this, and would give optimum performance in the remote
case. Alternatively, if people want to be able to cache
source data on the server, you'd need some way of hinting
to the server: "put this pixmap in the optimum place for
compositing source data".
_______________________________________________
Render mailing list
[EMAIL PROTECTED]
http://XFree86.Org/mailman/listinfo/render