From: Siarhei Siamashka <[email protected]> This patchset adds ARM NEON optimized fast paths for a few types of compositing operations when the source image is using nearest scaling. The selection of operations to be optimized was basically done based on what 'scaling-test' program currently supports. But any existing ARM NEON optimizations can be relatively easily reused and extended to support nearest scaling for the source image.
The code from these patches builds up on the main loop template from the header file 'pixman-fast-path.h'. It was introduced earlier in: http://lists.freedesktop.org/archives/pixman/2010-September/000547.html There are a few shortcomings with this template. Like the NONE repeat is still causing problems for some operations. Also REFLECT repeat should be added in order to finally fully replace and retire NEAREST_FAST_PATH macro from 'pixman-fast-path.c'. Moreover, looks like adding support for scaled fast paths with a8 mask is also needed for some practical use cases. I'm trying to experiment with generalizing and further improving the scaled fast path template in the following branch (it's still not fully ready yet): http://cgit.freedesktop.org/~siamashka/pixman/log/?h=a8-mask-scaling-wip Regarding the performance of this code. The pixel fetcher supporting nearest scaling for NEON fast paths is still not fully tuned for performance. The things to try are: * experiment with prefetch instructions to see if they can improve performance (they seem to be useless for the scaled source, but the prefetch effect for mask and destination still can be investigated) * deinterleaving of color components for the scaled source images is currently expensive, so selectively using it for the destination and mask while disabling it for the source may be tried * alternative scaled pixels fetching methods may be tried: a) use VLD1 instructions to load pixel data to NEON (like it is done now) b) use LDR and OR instructions on ARM side to construct the pixel data in ARM registers and then move them to NEON side using VMOV (more work for ARM pipeline, less work for NEON) c) try using a single SMLATB instruction instead of a pair of MOV + ADD (SMLATB is slow and takes 2 cycles, has much higher latency, but may reduce pressure on the instruction decoder if it is the bottleneck) d) try to split scaled pixels fetcher into parts in order to better interleave ARM and NEON instructions and avoid any possible bubbles in NEON pipeline in all the cases, instructions scheduling still might be improved more As for the performance, the patches definitely improve it. Still in some cases the improvement of using these new fast paths over doing separate scaling and then separate composite operation is not that large. One case that works somewhat better than the others is src_0565_8888 (all the performance numbers are in commit messages). It provides performance 76.98 MPix/s. While separate scaling would be 94.91 MPix/s and separate 565->8888 conversion would be 137.78 MPix/s. So the performance of doing scaling and then compositing separately can be estimated as: 1 / (1 / 94.91 + 1 / 137.78) = 56.2 MPix/s These numbers are for the case when scaling factor is very close to 1x. This estimation also does not take into account the locality of data (intermediate temporary data could be stored in L1 cache). But my older experiments with memory performance on OMAP3 chips (ARM Cortex-A8 CPU) showed that contrary to expectations, doing copy in separate steps like "source buffer -> L1 cache" and then "L1 cache -> destination buffer" is about twice slower than doing direct copy "source buffer -> destination buffer". I have some theories about why it may happen. One of them is that the best total memory throughput may be achieved when memory is both read and written at the same time (kind of full-duplex). Another one is that the data in the intermediate buffer may be evicted too often from L1 cache (due to random replacement policy). These are just speculations (maybe have nothing common with the real cause), but the end result is the same: the default "fetch" -> "combine" -> "store" pixman pipeline works slow on this type of hardware. So naturally, when implementing scaled fast paths, I'm also currently favouring single pass processing for everything. Hopefully the performance still may be improved more by tweaking scaled pixels fetcher. But of course, it might be that scaling (scattered small memory accesses) adds more pressure on the load-store unit and changes the whole picture significantly. Let's see. These patches are also available here: http://cgit.freedesktop.org/~siamashka/pixman/log/?h=sent/neon-nearest-scaling-20101103 Siarhei Siamashka (10): ARM: fix 'vld1.8'->'vld1.32' typo in add_8888_8888 NEON fast path ARM: NEON: source image pixel fetcher can be overrided now ARM: nearest scaling support for NEON scanline compositing functions ARM: macro template in C code to simplify using scaled fast paths ARM: performance tuning of NEON nearest scaled pixel fetcher ARM: NEON optimization for scaled over_8888_8888 with nearest filter ARM: NEON optimization for scaled over_8888_0565 with nearest filter ARM: NEON optimization for scaled src_8888_0565 with nearest filter ARM: NEON optimization for scaled src_0565_8888 with nearest filter ARM: optimization for scaled src_0565_0565 with nearest filter pixman/pixman-arm-common.h | 40 ++++++++ pixman/pixman-arm-neon-asm.S | 100 ++++++++++++++----- pixman/pixman-arm-neon-asm.h | 216 +++++++++++++++++++++++++++++++++++++++--- pixman/pixman-arm-neon.c | 30 ++++++ pixman/pixman-arm-simd-asm.S | 70 ++++++++++++++ pixman/pixman-arm-simd.c | 7 ++ 6 files changed, 423 insertions(+), 40 deletions(-) -- 1.7.2.2 _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
