On Thu, 27 Sep 2012 15:18:47 +1200 Andre Renaud <an...@bluewatersys.com> wrote:
> Hi, > I've put together an ARMv5 optimised blitter & fill routine (heavily > based on some code from the Android pixelflinger system). It provides > 16 & 32-bit fills, over 8888 -> 8888 blitting and over 8888 -> 0565 > blitting. I realise that ARMv5 is getting pretty old now, but it's > probably still popular enough that someone else might want this. Hi. Thanks for sharing this information and patches. The benchmark numbers for pixelflinger code are especially interesting. However running "make check" shows that this fails some pixman correctness expectations: PASS: a1-trap-test PASS: pdf-op-test PASS: region-test PASS: region-translate-test PASS: fetch-test rotate test passed (checksum=03A24D51) PASS: rotate-test PASS: oob-test PASS: infinite-loop PASS: trap-crasher PASS: alpha-loop PASS: scaling-crash-test PASS: scaling-helpers-test PASS: gradient-crash-test region_contains test passed (checksum=D2BF8C73) PASS: region-contains-test PASS: alphamap PASS: stress-test composite traps test passed (checksum=E3112106) PASS: composite-traps-test blitters test failed! (checksum=03E77E23, expected 3E1DD2E8) FAIL: blitters-test glyph test passed (checksum=741CB2DB) PASS: glyph-test scaling test failed! (checksum=5BC5CFDC, expected 03A23E0C) FAIL: scaling-test affine test failed! (checksum=12AA0D73, expected 74050F50) FAIL: affine-test ---- Test 4555550 failed ---- Operator: OVER Source: a8r8g8b8, 1x1 Destination: a8r8g8b8, 1x1 R G B A Rounded Source color: 1.000 1.000 1.000 0.000 1.000 1.000 1.000 0.000 Dest. color: 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Expected: 1.000 1.000 1.000 1.000 Got: 255 255 254 0 [pixel: 0x00fffffe] Min accepted: 254 254 254 254 Max accepted: 256 256 256 256 Test 0x0045831E failed. FAIL: composite ============================================= 4 of 22 tests failed > I've attached the patch here (please excuse the lack of inline patch - > I'll blame that on a combination of my lack of git knowledge, and > gmails inability to do text emails without line wrapping). There is a useful manual about git here: http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#patch-series Use "git format-patch --cover-letter HEAD^^^^" to create the patches. The number of '^' characters above specifies the number of patches in your series you want to submit. Edit the cover letter and then use "git send-email --to=pixman@lists.freedesktop.org *.patch" to send the patches to the list. Github also has some introductory documentation. Nowadays it's difficult to do software development without being familiar with git. It is used by way too many open source projects :) > As the code comes from the Android code base, it is Apache licensed - > is this a problem with Pixman? I am not a lawyer, but Apache license does not seem to be compatible with GPL2 according to: http://www.apache.org/licenses/GPL-compatibility.html This is an extra restriction, which does not exist in the current pixman MIT/X11 license. I personally would not touch Apache licensed code without a really good reason. > For performance results, here are the relevant numbers (tested on an > AT91SAM9G45 - ARM926EJS). The first set of the pair is before the > change, the second is after. > > reference memcpy speed = 230.1MB/s (57.5MP/s for 32bpp fills) > > src_n_0565 = L1: 19.33 L2: 19.75 M: 19.74 ( 17.16%) > HT: 15.77 VT: 15.27 R: 15.25 RT: 7.60 ( 46Kops/s) > src_n_0565 = L1: 195.58 L2: 126.99 M:128.46 (110.53%) > HT: 46.37 VT: 45.73 R: 42.87 RT: 11.26 ( 50Kops/s) > > src_n_x888 = L1: 19.32 L2: 19.75 M: 19.78 ( 34.39%) > HT: 15.79 VT: 15.38 R: 15.31 RT: 7.64 ( 48Kops/s) > src_n_x888 = L1: 111.67 L2: 66.34 M: 65.85 (113.33%) > HT: 36.00 VT: 35.79 R: 33.80 RT: 11.09 ( 50Kops/s) > > src_n_8888 = L1: 19.33 L2: 19.75 M: 19.78 ( 34.39%) > HT: 15.80 VT: 15.39 R: 15.32 RT: 7.64 ( 48Kops/s) > src_n_8888 = L1: 109.40 L2: 66.36 M: 65.89 (113.39%) > HT: 36.03 VT: 35.90 R: 33.81 RT: 11.11 ( 50Kops/s) > > over_8888_0565 = L1: 5.08 L2: 4.49 M: 4.48 ( 11.68%) > HT: 2.80 VT: 2.71 R: 2.69 RT: 2.22 ( 23Kops/s) > over_8888_0565 = L1: 20.96 L2: 14.48 M: 14.63 ( 37.78%) > HT: 8.95 VT: 8.60 R: 8.19 RT: 4.48 ( 37Kops/s) > > over_8888_8888 = L1: 3.21 L2: 3.17 M: 4.22 ( 14.68%) > HT: 6.20 VT: 6.58 R: 6.63 RT: 4.28 ( 36Kops/s) > over_8888_8888 = L1: 21.84 L2: 12.40 M: 12.88 ( 44.32%) > HT: 8.77 VT: 8.85 R: 8.84 RT: 4.68 ( 38Kops/s) > > over_8888_x888 = L1: 9.24 L2: 8.25 M: 8.12 ( 28.22%) > HT: 6.83 VT: 6.77 R: 6.67 RT: 4.29 ( 36Kops/s) > over_8888_x888 = L1: 21.83 L2: 12.39 M: 12.88 ( 44.32%) > HT: 8.76 VT: 8.85 R: 8.84 RT: 4.68 ( 38Kops/s) As for the patch, there are some comments: > +dnl Check for ARMv5 > + > +AC_ARG_ENABLE(arm-v5, > + [AC_HELP_STRING([--disable-arm-v5], > + [disable ARM V5 fast paths])], > + [enable_arm_v5=$enableval], [enable_arm_v5=no]) > + > +if test $enable_arm_v5 = no ; then > + have_arm_v5=disabled > +else > + have_arm_v5=yes > +fi > + > +if test $have_arm_v5 = yes ; then > + AC_DEFINE(USE_ARM_V5, 1, [use ARM v5 assembly optimizations]) > +fi > + > +AM_CONDITIONAL(USE_ARM_V5, test $have_arm_v5 = yes) > + > +AC_MSG_CHECKING(whether to use ARM V5 assembler) > +AC_MSG_RESULT($have_arm_v5) Looks like arm-v5 support is disabled by default unless overrided by configure option. This is not consistent with the rest of the cpu specific optimizations (arm-simd, arm-neon). Normally everything is compiled in, but only gets used when the needed cpu features are detected at runtime. > +pixman_arm_blitrow_32: > + > + push {r4-r11} > +/* > + * r0 - dst > + * r1 - src > + * r2 - count > + */ > +.Lresidual_loop: > + mov r10, #0xFF > + orr r10, r10, r10, lsl #16 //mask = r10 = 0x00FF00FF > + > + subs r2, r2, #2 > + blt .Lblitrow32_single_loop > + > +.Lblitrow32_double_loop: > + ldm r0, {r3, r4} > + ldm r1!, {r5, r6} > + > + orrs r9, r3, r4 > + beq .Lblitrow32_loop_cond > + > + // First iteration > + lsr r7, r5, #24 //extract alpha > + and r8, r3, r10 //rb = (dst & mask) > + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1 > + and r9, r10, r3, lsr #8 //ag = (dst>>8) & mask > + > + mul r11, r8, r7 //RB = rb * scale > + mul r3, r9, r7 //AG = ag * scale > + > + // combine RB and AG > + and r11, r10, r11, lsr #8 //r8 = (RB>>8) & mask > + and r3, r3, r10, lsl #8 //r9 = AG & ~mask > + > + lsr r7, r6, #24 //extract alpha for second > iteration > + orr r3, r3, r11 > + > + // Second iteration > + and r8, r4, r10 //rb = (dst & mask) > + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1 > + and r9, r10, r4, lsr #8 //ag = (dst>>8) & mask > + > + mul r11, r8, r7 //RB = rb * scale > + mul r4, r9, r7 //AG = ag * scale > + > + // combine RB and AG > + and r11, r10, r11, lsr #8 //r8 = (RB>>8) & mask > + and r4, r4, r10, lsl #8 //r9 = AG & ~mask > + orr r4, r4, r11 > + > + // add src to combined value > + add r5, r5, r3 > + add r6, r6, r4 > + > +.Lblitrow32_loop_cond: > + subs r2, r2, #2 > + stm r0!, {r5, r6} > + > + bge .Lblitrow32_double_loop > + > +.Lblitrow32_single_loop: > + adds r2, #1 > + blo .Lexit > + > + ldr r3, [r0] > + ldr r5, [r1], #4 > + > + cmp r3, #0 > + beq .Lblitrow32_single_store > + > + lsr r7, r5, #24 //extract alpha > + and r8, r3, r10 //rb = (dst & mask) > + rsb r7, r7, #256 //r5 = scale = (255-alpha)+1 > + and r9, r10, r3, lsr #8 //ag = (dst>>8) & mask > + > + mul r8, r8, r7 //RB = rb * scale > + mul r9, r9, r7 //AG = ag * scale > + > + // combine RB and AG > + and r8, r10, r8, lsr #8 //r8 = (RB>>8) & mask > + and r9, r9, r10, lsl #8 //r9 = AG & ~mask > + orr r3, r8, r9 This code does (x * (255 - a + 1)) / 256 instead of (x * (255 - a) + 127) / 255 It's not necessarily bad, just trades accuracy for performance. At least the corner opaque/transparent cases are handled normally. > + add r5, r5, r3 //add src to combined value And here pixman normally does saturated addition. The differences show up when at least one of the R/G/B color components of the source pixel is bigger than alpha. This is described in "Additive and blended alpha in a single operation" part of http://home.comcast.net/~tom_forsyth/blog.wiki.html#[[Premultiplied%20alpha]] The older discussion in the pixman mailing list was here: http://comments.gmane.org/gmane.comp.graphics.pixman/454 All modern processors have no problems with saturated addition (that's handled by UQADD8 instruction starting from ARMv6). But emulating saturated addition becomes a serious performance problem for ARMv5, as it is done with the following code: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n57 http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.27.2#n209 When having both source and destination pixels split into RB (00 RR 00 BB) and AG (00 AA 00 GG) parts, performing the saturated addition for these parts and combining the results needs 11 ARM instructions in the best case if the compiler does its job correctly (and often it does not). The code from pixelflinger and pixman ARMv6 code only uses 2 instructions for this (not to mention that splitting the source pixel into AG and RB parts is not needed). Saturated addition emulation is very slow. However verifying whether the saturated addition is needed can be done a bit faster: http://permalink.gmane.org/gmane.comp.graphics.pixman/461 Or in ARM assembly (should work, but I haven't tested it): orr TMP, SRC, #0xFF0000 cmp SRC, SRC, lsl #8 cmphs TMP, SRC, lsl #16 cmphs SRC, SRC, lsl #24 blo <we_actually_need_saturated_addition> That's 5 instructions, which is less overhead than actually performing saturated addition every time. And if we have a compositing operation with solid source, this check can be done just once per function outside the loop. Everything depends on whether we want to keep the current pixman behaviour also on ARMv5 and just minimize the performance hit. Or cut the corners in the same way as pixelflinger. > +.Lblitrow32_single_store: > + str r5, [r0], #4 > + > +.Lexit: > + pop {r4-r11} > + bx lr > #if defined(USE_ARM_SIMD) || defined(USE_ARM_NEON) || defined(USE_ARM_IWMMXT) This also needs "|| defined(USE_ARM_V5)" or the code will fail to compile if you disable arm-simd, arm-neon and arm-iwmmxt, but enable just arm-v5. > @@ -168,6 +169,8 @@ detect_cpu_features (void) > features |= (ARM_V7 | ARM_V6); > else if (strncmp (plat, "v6l", 3) == 0) > features |= ARM_V6; >+ else if (strncmp (plat, "v5l", 3) == 0) >+ features |= ARM_V5; > } > } The ARMv6 and ARMv7 processors also can execute ARMv5 enhanced dsp instructions, but you are setting ARM_V5 bit only for "v5l". And to be more correct, this should be a check not for ARMv5, but for "edsp" feature. Also the assembly file should have ".func" / ".endfunc" directives around functions (otherwise thumb interworking can be broken) and mark the stack as non-executable (potential security issue). BTW, the nearest scaling code from pixman-arm-simd-asm.S can be moved to the arm-v5 file if pixman gets ARMv5 support. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman