Hi,

On Mon, Apr 30, 2007 at 02:27:49PM +0300, ext Siarhei Siamashka wrote:
> On Friday 27 April 2007 04:43, Daniel Stone wrote:
> > I don't think Tornado supports YUV420, but I can check in the specs
> > tomorrow.  My better C version basically does two macroblocks at a time,
> > ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
> > believe me).  This eliminates the branch, since your surface is
> > guaranteed to be word-aligned, so if you do all 32-bit writes, you can
> > just drop the branch as you know every write will be aligned.
> >
> > This will be really fast.
> 
> Optimized YV12 -> YUV420 convertor is done. The sources can be found here:
> https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libswscale_nokia770/?root=mplayer
> 
> Take a look at 'arm_colorconv.h' and 'arm_colorconv.S' files. Also there is a
> test program ('test_colorconv') which can ensure that everything works
> correctly and fast:
> 
> ~ $ ./test_colorconv
> [results follow]
> 
> ARMv6 optimized YV12->YUV420 convertor is about 2.5x faster
> than current code used in N800 xserver. So it should provide a nice
> improvement for video :)

Indeed.  Unfortunately this is slightly misleading in that it only shows
the raw write speed.  RFBI can't deal with the sorts of speeds that your
hyper-optimised version is pumping out, e.g.  So it's mainly just about
cutting the latency into the critical path to low enough that it makes
no difference.

> I doubt that your better C version can beat it or even get any close.

Of course not.

> There are two important optimizations in this code:
> 1. Cache prefetch with PLD instruction (added in '_armv5' version) which
> boosts performance to 70 megapixels per second. Inner loop is unrolled
> to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
> such unrolling is convenient). This is the most important improvement.
> You can try using __builtin_prefetch() from C code to do the same
> optimization.

Ah, sounds useful.  From what Dan Amelang's been saying on xorg@, gcc
should coalesce four 32-bit reads into one 128-bit read, but this sounds
promising as well.

> 2. The use of ARMv6 instruction REV16 to do bytes swapping for high and low
> 16-bit register parts, this optimization was added in '_armv6' version and
> boosted performance even more to 85 megapixels per second. This 
> optimization is highly unlikely probably impossible for C version at all.

Sounds useful.

> I was a bit wrong about YUV420 format in my previous post.
> 
> Suppose we have planar YV12 image with the following data.
> Y plane: Y1 Y2 Y3 Y4 ...
> U plane: U1 __ U2 __ ...
> 
> Normal YUV420 (according to pictures in Epson docs)  would be the following:
> U1 Y1 Y2 U2 Y3 Y4 ...
> 
> But appears (most likely because of 16-bit interface and some endian
> differences between ARM and Epson chip) that each pair of bytes is 
> swapped and we actually get the following somewhat weird layout:
> Y1 U1 U2 Y2 Y4 Y3 ...

Right, hence the comment in the code is correct. ;)

> As for the other possible Xv optimizations. You mentioned that fallback code
> is not important at all. But imagine 640x480 video playback in windowed 
> mode. Decoding it will require quite a lot of resources, but additionally
> scaling it down using a slow fallback code will be a finishing blow. In
> addition, a solution (fast JIT accelerated YV12->YUY2 scaler) for this 
> problem already exists. I can also modify this scaler to support
> YV12->YUV420 scaling. An interesting thing here is that this scaler
> could be also used by xserver to solve graphics bus bandwidth 
> issues. Imagine that we have some high resolution video with high 
> framerate which exceeds graphics bus capabilities. In this case
> this video can be downscaled in software using JIT scaler to lower 
> resolution before sending data to LCD controller. What do you think?

IMO this is a policy issue, and X is 'mechanism, not policy'.  If you
want to adapt the scaler, I'm more than happy to include it, but I'm not
about to start doing automatic scaling.

IOW, 'ask a stupid question, get a stupid answer'.

> That's fine. Now I'm waiting for further instructions :) Should I try to
> prepare a complete patch for xserver? I'm really interested in getting
> this optimization into xserver as it would help to play high resolution
> videos. If you have any extra questions about the code or anything 
> else (for example I wonder what free license would be appriopriate
> for it), don't hesitate to contact me.

If you wanted to prepare a complete patch for the server, that would be
great, as I don't have time to get to it right now (trying to finish off
the merge with upstream, among others).  As for the license, just the
standard MIT boilerplate in hw/kdrive/omap/* is fine, but replace Nokia
Corporation/Daniel Stone with Siarhei Siamaskha, obviously.

> I did not try to build xserver sources yet as I did not have enough time 
> for that and xserver requires quite a number of build dependencies. Can 
> you  share some tips and tricks about maemo xserver development. Is it 
> difficult to compile (do I need any extra build scripts, tools, or
> configuration options) and install on N800 (is it safe to upgrade 
> xserver on N800 from .deb file)?

It's completely safe to upgrade from a deb if it's not broken.  If you
set up a standard Maemo build environment and run apt-get source
xorg-server and apt-get build-dep xorg-server, it should work just fine,
in theory.

I don't have any tips, per se.  Once I get it all integrated it'll be in
git, but for now, the only public source is the packages.

> I also tried to use YUV420 on Nokia 770, but it did not work well. According
> to Epson, this format should be supported by hardware. Also there is a
> constant OMAPFB_COLOR_YUV420 defined in omapfb.h in Nokia 770 kernel 
> sources. But actually using YUV420 was not very successful. Full screen update
> 800x480 in YUV420 seems to deadlock Nokia 770. Playback of centered 
> 640x480 video in YUV420 format was a bit better, at least I could decipher
> what's on the screen. But anyway, it looked like an old broken TV :) Image was
> not fixed but floating up and down, there were mirrors, tearings, some color
> distortion, etc. After video playback finished, the screen remained in
> inconsistent state with a striped garbage displayed on it. Starting video
> playback with YUY2 output fixed it. But anyway, looks like YUV420 is not
> supported properly in the framebuffer driver from the latest OS2006 kernel. 
> That's bad, it could provide ~30% improvement in video output perfrmance 
> for Nokia 770. Maybe upgrading framebuffer driver can fix this issue (and add
> tearsync support).

SoSSI is relatively quick, so you won't see much of a bandwidth win from
using YUV420 over YUV422.  Aside from that, I don't know, though.

Thanks again for working on this; glad to see someone cares enough to
help sort it out. :)

Cheers,
Daniel

Attachment: signature.asc
Description: Digital signature

_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Reply via email to