On Fri, Apr 27, 2007 at 03:14:43AM +0300, ext Siarhei Siamashka wrote:
> On Tuesday 24 April 2007 12:36, Daniel Stone wrote:
> > Right, the branch is a problem, and as I said, the branch can be avoided
> > and the writes optimised to be three 32-bit writes for two macroblocks,
> > instead of two 32-bit writes and two 16-bit writes.
> 
> I did not have much free time to do complete tests, but initial benchmarks
> show that actually even removing this branch and using three 16-bit writes
> improves performance quite significantly. The test program is here:
> http://ufo2000.sourceforge.net/files/yuv420test.c
> 
> It produces the following results if compiled with optimization  
> options "-O3 -fomit-frame-pointer -mcpu=arm1136j-s":
> 
> # ./yuv420test
> test: 'yv12toyuv420_xomap', time=5.220, memory bandwidth=61.576MB/s
> test: 'yv12toyuv420_yv12toyuv420_branch_removed', time=3.503, memory 
> bandwidth=91.754MB/s
> 
> An interesting thing about this test is that it uses 2504 frames 400x240 
> each, that's the same number of frames as Nokia_N800.avi video has. 
> And mplayer spent 12,365s on video output when playing this video while 
> YV12->YUV420 conversion should have taken 5.220s as benchmarked in 
> this test.  So now color conversion is roughly half of the time spent on video
> output for this resolution. Some tests with higher resolution videos will be
> done later.

Good news!  Thanks for checking it out.

> As you see from the benchmark results, we can get 1.5x improvement 
> already for color conversion with just a trivial removal of a piece of
> redundant code. Was that branch in the code supposed to improve 
> performance? Seems like it resulted in quite the opposite effect.

You can't do unaligned writes to that area, so you need all 16-bit
writes or the branch.  The branch has always been a giant FIXME, but it
never got properly sorted thanks to the magic of deadlines, and a ton of
corner cases popping up at the last minute.  Sigh.

> I'll make a really optimized version of YV12 -> YUV420 convertor on this
> weekend (removing branch is good, but I feel that it can be improved 
> more) and will try to use it on Nokia 770, any extra video performance
> improvement will be useful there. I hope that the framebuffer driver on 
> Nokia 770 supports YUV420 color format properly.

I don't think Tornado supports YUV420, but I can check in the specs
tomorrow.  My better C version basically does two macroblocks at a time,
ensuring all 32-bit writes (which _really_ helps over 16-bit writes,
believe me).  This eliminates the branch, since your surface is
guaranteed to be word-aligned, so if you do all 32-bit writes, you can
just drop the branch as you know every write will be aligned.

This will be really fast.

> By the way, does anybody know if it is possible to enable tearsync support
> on Nokia 770 (by backporting some changes from N800 kernel or in some 
> other way)?

You can build the 770 kernel from the linux-omap tree, and support will
be there.

> > However, I don't think the lessons from the 770 are necessarily
> > _directly_ applicable to the N800: on the 770, our bottleneck is
> > decoding speed.  The bottleneck on the N800 is exactly the opposite:
> > video output.
> 
> I can't agree here. Memory speed is actually a lot faster on N800, the only
> trouble is graphics bus performance, but sending data to LCD controller
> through this bus does not introduce any load on ARM core and it can freely
> decode the next frame of video at the same time. At least this was the case
> with the previous version of firmware (I did not have enough time to see what
> was changed in framebuffer API and do any video tests with it).

Right.

> But color conversion is done by ARM core and it consumes precious cpu 
> cycles which could be used for decoding higher resolution/bitrate video.
> Optimizing color conversion will improve video performance. The 
> improvement will be most likely only within a few percents overall, but 
> every little bit helps.

Indeed.  However, once we remove stupid things like the branch which are
in the direct critical path of writes to the framebuffer, then I don't
think there's a hell of a lot left to gain.  But I'm more than happy to
be proven wrong. :)

> > Bear in mind that, unless you explicitly disable it (the Xv attribute is
> > something like XV_OMAP_VSYNC), the X server _will_ flush all pending
> > writes before the next frame is put through.  Else you get tearing,
> > because you can be halfway through an update, and writing the next frame
> > to the framebuffer, so which frame is being picked up, changes halfway
> > through.
> >
> > Try forcing XV_OMAP_VSYNC (or whatever it is) to 0, and comparing the
> > results.
> 
> OK, thanks, I'll try this test too and check if it affects Xv performance.
> But I thought that using 12bpp color format _and_ sending only as much 
> data as needed should solve the problem. Of course 800x480 * 16bpp * 30fps
> would be 23MB/s and it is too much. But for example 640x480 * 12bpp * 30fps =
> 12.3MB/s. Is the graphics bus fast enough to handle this?

I think we can do 12.3MB/s, yes.

> > before will bring us up to the point where any improvement that we can
> > make in that conversion will be eclipsed by the time taken to send it
> > over the bus, I believe.  But I can't prove that.
> 
> Well, I believe that every optimization which can provide a visible
> improvement (at least a few percents) is worth it. Optimizations are
> cumulative, a number of small 1-3% improvements added together result 
> in a significant performance boost.

Sure.  Unfortunately my job has other functions than to make video
decoding really, really fast, so I'm happy to merge, review, offer
feedback, and help you out where I can be useful, but I can't throw much
time at this myself.

> The opposite is also true, if one is lazy and adds inefficient code here and
> there, these small performance regressions accumulate and the program 
> starts to crawl :)

Indeed, I already fixed quite a few of these cases all through the X
server ...

> > > Well, it was just a comment for 'omapCopyPlanarDataYUV420' function 
> > > wrong and misleading,  nevermind :-) Now everything is clear.
> >
> > Hmm, is it?  Because, unless I was _really_ tired at the time I wrote it
> > (which is entirely possible), that's what the code does, and it works,
> > so ...
> 
> Yes, this part seems wrong to me (maybe it was an old stale comment?):
> /*
>  * Copy I420 data to the custom 'YUV420' format, which is actually:
>  * y11 u11,u12,u21,u22 u13,u14,u23,u24 y12 y14 y13
>  * y21 v11,v12,v21,v22 v13,v14,v23,v24 y22 y24 y23
>  * ...
>  */
> 
> I was pretty much confused until actually looked at the code. Shouldn't it be
> something like this:
> 
> /*
>  * ... 'YUV420' format, which is actually:
>  * | u/v1 y1 y2 | u/v2 y3 y4 | ...
>  * ('u/v' means 'u' for even lines and 'v' for odd lines)
>  * ...
>  */

No?

Aligned:
*d2++ = (*sy & 0x000000ff) | (*sc << 8) | ((*sy & 0x0000ff00) << 16);

Unaligned:
*d1++ = (*sy & 0x000000ff) | ((*sc & 0x00ff) << 8);
*d1++ = ((*sc & 0xff00) >> 8) | (*sy & 0x0000ff00);

(Luma, chroma, chroma, luma.)

Cheers,
Daniel

Attachment: signature.asc
Description: Digital signature

_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Reply via email to