On Tuesday 16 March 2010, Alexander Larsson wrote: > Further simplifying by removing support for unit_x < 0 (i.e. mirrored > scaling) gives: > > == nearest tiled SRC == > op=1, src_fmt=20028888, dst_fmt=20028888, speed=655.11 MPix/s (156.19 FPS) > op=1, src_fmt=20028888, dst_fmt=10020565, speed=136.02 MPix/s (32.43 FPS) > op=1, src_fmt=10020565, dst_fmt=10020565, speed=619.16 MPix/s (147.62 FPS)
Just to make it clear what is the use case. NORMAL repeat is important for the browser (zooming tiled backgrounds). The image quality is quite bad for NEAREST scaling, but if CPU is slow, it becomes a poor man's choice. If the code replicates a very small tile, almost no cache misses are expected when accessing the source image and the performance should be comparable to solid fill if memory bandwidth is a limiting factor. But CPU is obviously the bottleneck here for both Intel Core2 and ARM Cortex-A8. Writes to memory are nonblocking from the CPU point of view and go through the write combining buffer. As long as the CPU can't process data fast enough (and this is the case for 'nearest tiled SRC' test), memory writes are mostly transparent and don't affect the performance. For ordinary scaling, on Intel Core2 (x86-64) I get an amazing result showing that MPix/s rating is the same for both nonscaled and scaled blits using color format a8r8g8b8. But for r5g6b5 format, the nearest scaler falls a bit behind. Trying 8bpp would increase this gap even more. So the nearest scaler can be memory bandwidth limited if hardware prefetch works efficiently and CPU is much faster than memory. For lower bpp color formats, optimizing CPU usage still becomes important. For ARM Cortex-A8, scaling is always slower than simple blit, but it may need tweaking and adding explicit prefetch. Also NEON unit can access memory much faster, so it would be practically impossible to reach same performance as simple blit for scaling. > It might be interesting to duplicate the inner loop, once for each > unit_x sign to get this performance increase always. Duplicating inner loop is easy using the trick with always_inline function and described here: http://lists.freedesktop.org/archives/pixman/2010-March/000111.html But if fast performing scaling with REFLECT repeat is wanted (the one which checks only either lower or upper boundary per step), it has to be implemented with something like DFA. So that the function body would be duplicated for the cases walking forward (unit_x >= 0) and walking backwards (unit_x < 0). In the case of source image boundary crossing, unit_x can be negated, vx updated, then switch to a different state (block of code walking in opposite direction) can be performed using goto operator. This makes code a bit more complex, but ensures the best performance. Due to walking in both horizontal and vertical directions, a total 4 slightly different blocks of code may be required. Or vertical walking can just always check both boundaries as it is less important for performance. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
