Hello pixman mailing list. This is my first post, so I hope I'm following the list etiquette OK.
I have been working on improving pixman's performance on ARMv6/ARM11. Specifically, I'm targeting the Raspberry Pi, which uses a BCM2835 SoC, from the BCM2708 family. This uses an ARM1176JZF-S core, running at 700 MHz. General features of the ARM11J76ZF-S are a 4-way set-associative L1 data cache with cache line length of 8 words (128 bits) and a configurable size between 4KB and 64KB. The BCM2835 uses a L1 data cache size of 16KB, but also adds a Broadcom proprietary L2 cache of 128KB with cache lines of 16 words (256 bits) with flags to allow a cache line to be half valid. Empirical tests show that despite this, the write buffer operates at peak efficiency for 4-word aligned writes of 4 words. Even cacheline- aligned writes of 8 words are slower by over 30%; clearly there's some sort of shortcut happening in the 4-word case, because doing complete cache line fills take several times longer per byte than 4-word writes do. The Raspberry Pi bootloader has an option to disable write- allocate for the L2 cache (disable_l2cache_witealloc [sic]), and I didn't find it had any noticeable effect on 4-word write speeds, giving further credibility to this. (In fact, if anything, timings were slightly worse when write-allocate was disabled due to an increased fraction of test runs falling into a secondary cluster with a longer test runtime.) I also saw no measurable difference between timings for the VFP register file compared to the main ARM register file: again, the optimum size was 4 32-bit registers (or 2 64-bit registers). Although the use of the VFP would ease register pressure on the ARM register file, in every case where we're actually short of registers, we actually want to do some integer manipulations of the pixel data so it's not of any benefit to use the VFP. It would also limit the usefulness of this implementation to ARM11s that have VFP fitted, so I have not pursued this avenue further. Additional testing of prefetching has identified marked differences in timings for different address patterns. In particular, there is a 50% speed penalty if the address is not in the first 2 words of each 8 words: this has been tracked down to a fault in critical-word-first handling in the BCM2835 L2 cache. An even more extreme effect was observed if consecutive prefetches referenced the same address - this doubled the runtime (although I don't know if this is BCM2835 specific or not). Consequently, I have devised a prefetch scheme that is careful to prefetch only the addresses of the start of each cache line, and to only do so once per cache line. I am aware that some may question the targeting of BCM2835 specific cache behaviours in what is supposed to be a generic ARM11 implementation. However, the cache line size is fixed at 8 words across ARM1136, ARM1156 and ARM1176, so this approach will not lead to any cache lines being omitted from prefetch on any ARM11, and the overhead of branching over an unwanted PLD instruction which would actually have completed in a trivial amount of time on an ARM11 without the BCM2835's bugs should be minimal, so I think it's valid to propose this patch for all ARMv6 chips. My new ARMv6 fast paths are assembled using a hierarchy of assembly macros, in a method inspired by Siarhei's ARM NEON fast paths - although obviously the details are somewhat different. The majority of my time so far has been spent on optimising the memory reads and writes, since these dominate all but the more complex pixel processing steps. So far, I've only converted a handful of the most common operations into macro form so far: in the most part these correspond to blits and fills, plus the routines which had previously been included in pixman-arm-simd-asm.S as disasembled versions of C functions using inline assembler. However, I'm pleased to report that even in the L1 test where memory overheads are not an issue, these operations are seeing some improvements from processing more than one pixel at once, and by the use of the SEL instruction. One minor change in functionality that I should note is that previously the top level function pixman_blt() was a no-op on ARMv6, because neither the armv6 nor the generic C fast path sources filled in the "blt" field in their pixman_implementation_t structure. I've fixed this, for what it's worth (though I'm assuming that not much software can have been using it if it escaped notice before). I'm distributing the code as it is now to give people a chance to play with it over the Christmas break. Some of the things I intend to tackle in the new year are: * no thought has been put into 24bpp formats yet * there is no support for scaled plots, either nearest-neighbour or bilinear-interpolation as yet; in fact the two scaled blit routines are the sole survivors of the old pixman-arm-simd-asm.S at present * the number of fast path operations is small compared to other implementations; I'm targeting eventual parity with the number of operations in the NEON case To give you some idea of the improvements represented by this patch, please see the numbers below. These represent samples of 100 runs of lowlevel-blt-bench, and are comparing the head revision from git against the same with these patches applied. It seems that lowlevel-blt-bench is not very good at measuring the fastest operations, as a large proportional random error creeps in - I'm guessing it's to do with the way it tries to cancel out the function call overhead. All the results except those marked with (*) pass a statistical significance test (Student's independent two-sample t-test). I hope you'll agree that these are good results; the only fly in the ointment is the L1 test results for the three blit routines. In these cases, I'm competing against C fast path implementations that use memcpy() to do the blit, where memcpy() is already somewhat hand-tuned (although the tuning it does is obviously much less suited to memory-bound operations than the one I present here). Old New Improvement (%) Mean StdDev Mean StdDev src_n_8888 L1 157.1 11.7 590.8 122.0 276.1 L2 77.7 21.0 336.7 44.9 333.6 M 37.8 0.1 320.5 4.4 747.6 HT 32.1 0.3 75.5 1.9 135.2 VT 29.1 0.3 61.4 1.9 111.4 R 28.1 0.3 53.1 0.9 89.0 RT 14.4 0.5 17.3 0.7 20.2 src_n_0565 L1 153.9 8.2 1372.4 3404.4 791.9 L2 109.4 9.0 680.4 27.4 521.7 M 57.4 0.2 564.4 11.0 883.5 HT 44.4 0.7 84.5 1.8 90.2 VT 38.9 0.5 67.6 3.5 73.7 R 36.9 0.5 59.5 2.7 61.3 RT 16.2 0.6 18.2 1.1 12.8 src_n_8 L1 155.3 12.2 1569.0 2658.0 910.3 L2 107.4 3.7 1098.6 58.3 922.8 M 76.2 0.3 981.7 25.6 1188.9 HT 54.8 1.2 95.5 2.8 74.2 VT 46.6 0.6 74.8 1.7 60.5 R 43.5 0.7 66.8 1.3 53.6 RT 17.4 0.8 19.4 0.7 11.4 src_8888_8888 L1 452.2 385.2 352.1 42.8 -22.1 (*) L2 58.4 3.8 73.9 4.1 26.6 M 52.3 0.2 71.4 0.3 36.4 HT 25.1 0.2 31.1 0.4 23.6 VT 22.2 0.2 28.5 0.3 28.1 R 17.6 0.2 25.7 0.3 46.4 RT 6.5 0.2 9.4 0.2 44.8 src_0565_0565 L1 404.0 43.7 316.0 46.4 -21.8 L2 79.2 3.4 93.4 2.5 17.9 M 76.3 1.3 115.1 0.8 50.9 HT 33.5 0.4 39.8 0.8 18.6 VT 28.7 0.4 36.0 0.7 25.1 R 21.2 0.2 30.9 0.4 45.6 RT 6.7 0.1 9.6 0.3 43.3 src_8_8 L1 711.9 172.9 488.2 1167.6 -31.4 (*) L2 140.0 3.9 189.1 10.8 35.1 M 128.3 3.5 203.8 14.3 58.8 HT 38.8 0.6 46.2 2.3 19.0 VT 30.5 1.2 41.0 2.2 34.4 R 24.5 0.3 34.5 1.7 40.7 RT 7.3 0.2 9.8 0.6 33.9 src_x888_8888 L1 95.1 3.4 271.3 22.2 185.3 L2 26.3 1.4 72.3 5.4 174.7 M 20.2 0.1 70.9 1.7 250.8 HT 15.1 0.1 30.8 0.4 103.3 VT 14.4 0.1 28.2 0.3 95.5 R 14.4 0.1 25.6 0.3 77.3 RT 7.5 0.2 9.4 0.3 25.0 src_0565_8888 L1 37.1 1.7 66.9 1.9 80.3 L2 25.9 0.5 53.1 0.6 105.0 M 24.2 0.1 60.9 0.2 152.2 HT 14.2 0.1 28.4 0.3 99.1 VT 14.0 0.1 26.0 0.3 85.5 R 13.0 0.1 23.6 0.2 81.2 RT 5.5 0.1 9.1 0.2 65.3 add_8_8 L1 61.7 3.8 786.3 2557.9 1173.4 L2 38.0 0.5 112.3 3.6 196.0 M 39.1 0.1 106.4 0.8 172.2 HT 29.5 1.2 34.4 0.6 16.6 VT 29.3 0.3 33.0 0.6 12.7 R 20.6 0.2 27.3 0.9 32.8 RT 8.2 0.2 8.7 0.2 5.4 over_8888_8888 L1 32.4 0.8 38.0 0.8 17.3 L2 14.9 0.3 29.9 0.7 100.6 M 12.8 0.0 24.7 0.3 93.3 HT 10.1 0.2 14.0 0.1 37.8 VT 10.0 0.1 13.3 0.1 33.0 R 9.9 0.0 13.9 0.1 39.9 RT 5.8 0.1 6.4 0.2 10.0 over_8888_n_8888 L1 17.8 0.2 21.1 0.4 18.7 L2 11.1 0.2 19.1 0.1 71.2 M 9.9 0.0 19.1 0.0 93.0 HT 8.2 0.0 11.2 0.1 36.5 VT 8.1 0.0 10.6 0.1 31.9 R 8.1 0.0 10.9 0.1 35.4 RT 5.0 0.1 5.5 0.1 10.1 over_n_8_8888 L1 17.7 0.2 23.1 0.5 30.5 L2 13.8 0.4 22.1 0.1 60.4 M 11.8 0.1 22.4 0.0 90.6 HT 10.3 0.3 12.1 0.1 18.1 VT 9.8 0.2 11.5 0.1 17.5 R 9.2 0.0 10.7 0.1 16.6 RT 5.3 0.1 5.8 0.1 7.6 Regards, Ben Avison _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman