[Pixman] [PATCH 0/5] ARMv6: New fast path implementations that utilise prefetch

Ben Avison Fri, 21 Dec 2012 10:47:57 -0800

Hello pixman mailing list. This is my first post, so I hope I'm
following the list etiquette OK.


I have been working on improving pixman's performance on ARMv6/ARM11.
Specifically, I'm targeting the Raspberry Pi, which uses a BCM2835
SoC, from the BCM2708 family. This uses an ARM1176JZF-S core, running
at 700 MHz.

General features of the ARM11J76ZF-S are a 4-way set-associative L1
data cache with cache line length of 8 words (128 bits) and a
configurable size between 4KB and 64KB. The BCM2835 uses a L1 data
cache size of 16KB, but also adds a Broadcom proprietary L2 cache of
128KB with cache lines of 16 words (256 bits) with flags to allow a
cache line to be half valid.

Empirical tests show that despite this, the write buffer operates at
peak efficiency for 4-word aligned writes of 4 words. Even cacheline-
aligned writes of 8 words are slower by over 30%; clearly there's some
sort of shortcut happening in the 4-word case, because doing complete
cache line fills take several times longer per byte than 4-word writes
do. The Raspberry Pi bootloader has an option to disable write-
allocate for the L2 cache (disable_l2cache_witealloc [sic]), and I
didn't find it had any noticeable effect on 4-word write speeds,
giving further credibility to this. (In fact, if anything, timings
were slightly worse when write-allocate was disabled due to an
increased fraction of test runs falling into a secondary cluster with
a longer test runtime.)

I also saw no measurable difference between timings for the VFP
register file compared to the main ARM register file: again, the
optimum size was 4 32-bit registers (or 2 64-bit registers). Although
the use of the VFP would ease register pressure on the ARM register
file, in every case where we're actually short of registers, we
actually want to do some integer manipulations of the pixel data so
it's not of any benefit to use the VFP. It would also limit the
usefulness of this implementation to ARM11s that have VFP fitted, so I
have not pursued this avenue further.

Additional testing of prefetching has identified marked differences in
timings for different address patterns. In particular, there is a 50%
speed penalty if the address is not in the first 2 words of each 8
words: this has been tracked down to a fault in critical-word-first
handling in the BCM2835 L2 cache. An even more extreme effect was
observed if consecutive prefetches referenced the same address - this
doubled the runtime (although I don't know if this is BCM2835 specific
or not). Consequently, I have devised a prefetch scheme that is
careful to prefetch only the addresses of the start of each cache
line, and to only do so once per cache line.

I am aware that some may question the targeting of BCM2835 specific
cache behaviours in what is supposed to be a generic ARM11
implementation. However, the cache line size is fixed at 8 words
across ARM1136, ARM1156 and ARM1176, so this approach will not lead to
any cache lines being omitted from prefetch on any ARM11, and the
overhead of branching over an unwanted PLD instruction which would
actually have completed in a trivial amount of time on an ARM11
without the BCM2835's bugs should be minimal, so I think it's valid to
propose this patch for all ARMv6 chips.

My new ARMv6 fast paths are assembled using a hierarchy of assembly
macros, in a method inspired by Siarhei's ARM NEON fast paths -
although obviously the details are somewhat different. The majority of
my time so far has been spent on optimising the memory reads and
writes, since these dominate all but the more complex pixel processing
steps. So far, I've only converted a handful of the most common
operations into macro form so far: in the most part these correspond
to blits and fills, plus the routines which had previously been
included in pixman-arm-simd-asm.S as disasembled versions of C
functions using inline assembler. However, I'm pleased to report that
even in the L1 test where memory overheads are not an issue, these
operations are seeing some improvements from processing more than one
pixel at once, and by the use of the SEL instruction.

One minor change in functionality that I should note is that
previously the top level function pixman_blt() was a no-op on ARMv6,
because neither the armv6 nor the generic C fast path sources filled
in the "blt" field in their pixman_implementation_t structure. I've
fixed this, for what it's worth (though I'm assuming that not much
software can have been using it if it escaped notice before).

I'm distributing the code as it is now to give people a chance to play
with it over the Christmas break. Some of the things I intend to
tackle in the new year are:
* no thought has been put into 24bpp formats yet
* there is no support for scaled plots, either nearest-neighbour or
  bilinear-interpolation as yet; in fact the two scaled blit routines
  are the sole survivors of the old pixman-arm-simd-asm.S at present
* the number of fast path operations is small compared to other
  implementations; I'm targeting eventual parity with the number of
  operations in the NEON case

To give you some idea of the improvements represented by this patch,
please see the numbers below. These represent samples of 100 runs of
lowlevel-blt-bench, and are comparing the head revision from git
against the same with these patches applied. It seems that
lowlevel-blt-bench is not very good at measuring the fastest
operations, as a large proportional random error creeps in - I'm
guessing it's to do with the way it tries to cancel out the function
call overhead.

All the results except those marked with (*) pass a statistical
significance test (Student's independent two-sample t-test). I hope
you'll agree that these are good results; the only fly in the ointment
is the L1 test results for the three blit routines. In these cases,
I'm competing against C fast path implementations that use memcpy() to
do the blit, where memcpy() is already somewhat hand-tuned (although
the tuning it does is obviously much less suited to memory-bound
operations than the one I present here).

            Old             New                 Improvement (%)
        Mean    StdDev  Mean    StdDev

src_n_8888

L1      157.1   11.7    590.8   122.0           276.1
L2      77.7    21.0    336.7   44.9            333.6
M       37.8    0.1     320.5   4.4             747.6
HT      32.1    0.3     75.5    1.9             135.2
VT      29.1    0.3     61.4    1.9             111.4
R       28.1    0.3     53.1    0.9             89.0
RT      14.4    0.5     17.3    0.7             20.2

src_n_0565

L1      153.9   8.2     1372.4  3404.4          791.9
L2      109.4   9.0     680.4   27.4            521.7
M       57.4    0.2     564.4   11.0            883.5
HT      44.4    0.7     84.5    1.8             90.2
VT      38.9    0.5     67.6    3.5             73.7
R       36.9    0.5     59.5    2.7             61.3
RT      16.2    0.6     18.2    1.1             12.8

src_n_8

L1      155.3   12.2    1569.0  2658.0          910.3
L2      107.4   3.7     1098.6  58.3            922.8
M       76.2    0.3     981.7   25.6            1188.9
HT      54.8    1.2     95.5    2.8             74.2
VT      46.6    0.6     74.8    1.7             60.5
R       43.5    0.7     66.8    1.3             53.6
RT      17.4    0.8     19.4    0.7             11.4

src_8888_8888

L1      452.2   385.2   352.1   42.8            -22.1 (*)
L2      58.4    3.8     73.9    4.1             26.6
M       52.3    0.2     71.4    0.3             36.4
HT      25.1    0.2     31.1    0.4             23.6
VT      22.2    0.2     28.5    0.3             28.1
R       17.6    0.2     25.7    0.3             46.4
RT      6.5     0.2     9.4     0.2             44.8

src_0565_0565

L1      404.0   43.7    316.0   46.4            -21.8
L2      79.2    3.4     93.4    2.5             17.9
M       76.3    1.3     115.1   0.8             50.9
HT      33.5    0.4     39.8    0.8             18.6
VT      28.7    0.4     36.0    0.7             25.1
R       21.2    0.2     30.9    0.4             45.6
RT      6.7     0.1     9.6     0.3             43.3

src_8_8

L1      711.9   172.9   488.2   1167.6          -31.4 (*)
L2      140.0   3.9     189.1   10.8            35.1
M       128.3   3.5     203.8   14.3            58.8
HT      38.8    0.6     46.2    2.3             19.0
VT      30.5    1.2     41.0    2.2             34.4
R       24.5    0.3     34.5    1.7             40.7
RT      7.3     0.2     9.8     0.6             33.9

src_x888_8888

L1      95.1    3.4     271.3   22.2            185.3
L2      26.3    1.4     72.3    5.4             174.7
M       20.2    0.1     70.9    1.7             250.8
HT      15.1    0.1     30.8    0.4             103.3
VT      14.4    0.1     28.2    0.3             95.5
R       14.4    0.1     25.6    0.3             77.3
RT      7.5     0.2     9.4     0.3             25.0

src_0565_8888

L1      37.1    1.7     66.9    1.9             80.3
L2      25.9    0.5     53.1    0.6             105.0
M       24.2    0.1     60.9    0.2             152.2
HT      14.2    0.1     28.4    0.3             99.1
VT      14.0    0.1     26.0    0.3             85.5
R       13.0    0.1     23.6    0.2             81.2
RT      5.5     0.1     9.1     0.2             65.3

add_8_8

L1      61.7    3.8     786.3   2557.9          1173.4
L2      38.0    0.5     112.3   3.6             196.0
M       39.1    0.1     106.4   0.8             172.2
HT      29.5    1.2     34.4    0.6             16.6
VT      29.3    0.3     33.0    0.6             12.7
R       20.6    0.2     27.3    0.9             32.8
RT      8.2     0.2     8.7     0.2             5.4

over_8888_8888

L1      32.4    0.8     38.0    0.8             17.3
L2      14.9    0.3     29.9    0.7             100.6
M       12.8    0.0     24.7    0.3             93.3
HT      10.1    0.2     14.0    0.1             37.8
VT      10.0    0.1     13.3    0.1             33.0
R       9.9     0.0     13.9    0.1             39.9
RT      5.8     0.1     6.4     0.2             10.0

over_8888_n_8888

L1      17.8    0.2     21.1    0.4             18.7
L2      11.1    0.2     19.1    0.1             71.2
M       9.9     0.0     19.1    0.0             93.0
HT      8.2     0.0     11.2    0.1             36.5
VT      8.1     0.0     10.6    0.1             31.9
R       8.1     0.0     10.9    0.1             35.4
RT      5.0     0.1     5.5     0.1             10.1

over_n_8_8888

L1      17.7    0.2     23.1    0.5             30.5
L2      13.8    0.4     22.1    0.1             60.4
M       11.8    0.1     22.4    0.0             90.6
HT      10.3    0.3     12.1    0.1             18.1
VT      9.8     0.2     11.5    0.1             17.5
R       9.2     0.0     10.7    0.1             16.6
RT      5.3     0.1     5.8     0.1             7.6

Regards,
Ben Avison
_______________________________________________
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman

[Pixman] [PATCH 0/5] ARMv6: New fast path implementations that utilise prefetch

Reply via email to