Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.
Nemanja Lukic nlu...@mips.com writes: [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.25.3 [ 0]imagexfce4-terminal-a1 138.223 139.070 0.33%6/6 [ # ] image16: pixman 0.25.3 [ 0] image16xfce4-terminal-a1 132.763 132.939 0.06%5/6 I'm curious why you chose this particular benchmark? The main path that xfce4-terminal-a1 exercises is over_n_1_ and add_1_1. As far as I can tell it doesn't actually hit the two fast paths that you added, which makes it suspicious where the speed-up is coming from. Soren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.
Hi Soren, I usually select cairo-perf-trace that utilize optimized fast path the most. In this case, xfce4-terminal-a1 proved to be that one. I use oprofile to check CPU utilization. Here is oprofile log I got for the xfce4-terminal-a1: CPU: MIPS 74K, speed 0 MHz (estimated) Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 4 samples %image name app name symbol name 2658517 50.3337 no-vmlinux no-vmlinux /no-vmlinux 1216517 23.0323 libpixman-1.so libpixman-1.so pixman_composite_over_n___ca_asm_mips 2709955.1308 libc-2.11.2.so libc-2.11.2.so memset 1650573.1250 libm-2.11.2.so libm-2.11.2.so floor 1398802.6483 libpixman-1.so libpixman-1.so pixman_fill_buff32_mips_dsp 1363032.5806 libpixman-1.so libpixman-1.so fetch_scanline_a8 61821 1.1705 libc-2.11.2.so libc-2.11.2.so memcpy ... All other traces don't utilize this fast-path that much (this is what my oprofile runs on the test system showed). If you know some more suitable trace (or system configuration I need to have, like fonts installed, etc), please let me know, and I'll re-run the benchmarks and update the commit. Thanks, Nemanja Lukic -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Monday, March 12, 2012 10:05 PM To: Søren Sandmann Cc: Lukic, Nemanja; pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com Subject: Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n___ca and over_n__0565_ca fast paths. On Mon, Mar 12, 2012 at 10:48 PM, Søren Sandmann sandm...@cs.au.dk wrote: Nemanja Lukic nlu...@mips.com writes: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image xfce4-terminal-a1 138.223 139.070 0.33% 6/6 [ # ] image16: pixman 0.25.3 [ 0] image16 xfce4-terminal-a1 132.763 132.939 0.06% 5/6 I'm curious why you chose this particular benchmark? The main path that xfce4-terminal-a1 exercises is over_n_1_ and add_1_1. As far as I can tell it doesn't actually hit the two fast paths that you added, which makes it suspicious where the speed-up is coming from. I think it may actually depend on what fonts are installed in the system and I vaguely remember encountering this at least once. If the suitable bitmap fonts are missing, then the benchmark might fallback to some other font and exercise different fast paths. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.
On Mon, Mar 12, 2012 at 11:20 PM, Lukic, Nemanja nlu...@mips.com wrote: Hi Soren, I usually select cairo-perf-trace that utilize optimized fast path the most. In this case, xfce4-terminal-a1 proved to be that one. I use oprofile to check CPU utilization. Here is oprofile log I got for the xfce4-terminal-a1: CPU: MIPS 74K, speed 0 MHz (estimated) Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 4 samples % image name app name symbol name 2658517 50.3337 no-vmlinux no-vmlinux /no-vmlinux 1216517 23.0323 libpixman-1.so libpixman-1.so pixman_composite_over_n___ca_asm_mips 270995 5.1308 libc-2.11.2.so libc-2.11.2.so memset 165057 3.1250 libm-2.11.2.so libm-2.11.2.so floor 139880 2.6483 libpixman-1.so libpixman-1.so pixman_fill_buff32_mips_dsp 136303 2.5806 libpixman-1.so libpixman-1.so fetch_scanline_a8 61821 1.1705 libc-2.11.2.so libc-2.11.2.so memcpy ... All other traces don't utilize this fast-path that much (this is what my oprofile runs on the test system showed). If you know some more suitable trace (or system configuration I need to have, like fonts installed, etc), please let me know, and I'll re-run the benchmarks and update the commit. You can try to install terminus font (http://terminus-font.sourceforge.net/) just to check if this has any effect on the fast paths used. However the trace will not be useful for benchmarking your over_n___ca and over_n__0565_ca optimizations any more. Anyway, the purpose of running benchmarks is to confirm the performance improvement, so I guess this trace is also fine even though it does not behave as originally intended. By the way, oprofile logs are also quite informative and may be useful as part of the commit message. By the way, it is a good idea to configure oprofile to collect statistics separately per process instead of the flat report for the whole system. This can be done in the following way: # opcontrol --deinit # opcontrol --separate=kernel # opcontrol --init Then collect the statistics: # opcontrol --reset # opcontrol --start # ./some-test-binary # opcontrol --stop And show it: # opreport -l ./some-test-binary When the statistics is collected per process, the idle time currently attributed to no-vmlinux will disappear, the results should become perfectly reproducible across multiple runs and can be also used to evaluate the effect of optimizations. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.
From: Nemanja Lukic nemanja.lu...@rt-rk.com Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): lowlevel-blt-bench: over_n___ca = L1: 8.32 L2: 7.65 M: 6.38 ( 51.08%) HT: 5.78 VT: 5.74 R: 5.84 RT: 4.39 ( 37Kops/s) over_n__0565_ca = L1: 7.40 L2: 6.95 M: 6.16 ( 41.06%) HT: 5.72 VT: 5.52 R: 5.63 RT: 4.28 ( 36Kops/s) cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.25.3 [ 0]imagexfce4-terminal-a1 138.223 139.070 0.33%6/6 [ # ] image16: pixman 0.25.3 [ 0] image16xfce4-terminal-a1 132.763 132.939 0.06%5/6 Optimized: lowlevel-blt-bench: over_n___ca = L1: 19.35 L2: 23.84 M: 13.68 (109.39%) HT: 11.39 VT: 11.19 R: 11.27 RT: 6.90 ( 47Kops/s) over_n__0565_ca = L1: 18.68 L2: 17.00 M: 12.56 ( 83.70%) HT: 10.72 VT: 10.45 R: 10.43 RT: 5.79 ( 43Kops/s) cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ]image: pixman 0.25.3 [ 0]imagexfce4-terminal-a1 130.400 131.720 0.46%6/6 [ # ] image16: pixman 0.25.3 [ 0] image16xfce4-terminal-a1 125.830 126.604 0.34%6/6 --- pixman/pixman-mips-dspr2-asm.S | 219 + pixman/pixman-mips-dspr2-asm.h | 296 pixman/pixman-mips-dspr2.c | 12 ++ pixman/pixman-mips-dspr2.h | 42 ++ 4 files changed, 569 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-mips-dspr2-asm.S b/pixman/pixman-mips-dspr2-asm.S index f1087a7..6a0fc18 100644 --- a/pixman/pixman-mips-dspr2-asm.S +++ b/pixman/pixman-mips-dspr2-asm.S @@ -308,3 +308,222 @@ LEAF_MIPS_DSPR2(pixman_composite_src_x888__asm_mips) nop END(pixman_composite_src_x888__asm_mips) + +LEAF_MIPS_DSPR2(pixman_composite_over_n___ca_asm_mips) +/* + * a0 - dst (a8r8g8b8) + * a1 - src (32bit constant) + * a2 - mask (a8r8g8b8) + * a3 - w + */ + +SAVE_REGS_ON_STACK 16, s0, s1, s2, s3, s4, s5, s6, s7 +beqz a3, 4f + nop +li t6, 0xff +addiut7, zero, -1 /* t7 = 0x */ +srl t8, a1, 24 /* t8 = srca */ +li t9, 0x00ff00ff +addiut1, a3, -1 +beqz t1, 3f /* last pixel */ + nop +beq t8, t6, 2f /* if (srca == 0xff) */ + nop +1: + /* a1 = src */ +lw t0, 0(a2)/* t0 = mask */ +lw t1, 4(a2)/* t1 = mask */ +or t2, t0, t1 +beqz t2, 12f /* if (t0 == 0) (t1 == 0) */ + addiu a2, a2, 8 +and t3, t0, t1 +move s0, t8 /* s0 = srca */ +move s1, t8 /* s1 = srca */ +move t4, a1 /* t4 = src */ +move t5, a1 /* t5 = src */ +lw t2, 0(a0)/* t2 = dst */ +beq t3, t7, 11f /* if (t0 == 0x) (t1 == 0x) */ + lw t3, 4(a0)/* t0 = dst */ +MIPS_2xUN8x4_MUL_2xUN8x4 a1, a1, t0, t1, t4, t5, t9, s0, s1, s2, s3, s4, s5 +MIPS_2xUN8x4_MUL_2xUN8 t0, t1, t8, t8, s0, s1, t9, s2, s3, s4, s5, s6, s7 +11: +not s0, s0 +not s1, s1 +MIPS_2xUN8x4_MUL_2xUN8x4 t2, t3, s0, s1, s2, s3, t9, t0, t1, s4, s5, s6, s7 +addu_s.qbt0, t4, s2 +addu_s.qbt1, t5, s3 +sw t0, 0(a0) +sw t1, 4(a0) +12: +addiua3, a3, -2 +addiut1, a3, -1 +bgtz t1, 1b + addiu a0, a0, 8 +b3f + nop +2: + /* a1 = src */ +lw t0, 0(a2)/* t0 = mask */ +lw t1, 4(a2)/* t1 = mask */ +or t2, t0, t1 +beqz t2, 22f /* if (t0 == 0) (t1 == 0) */ + addiu a2, a2, 8 +and t2, t0, t1 +move s0, a1 +beq t2, t7, 21f /* if (t0 == 0x) (t1 == 0x) */ + moves1, a1 +lw t2, 0(a0)/* t2 = dst */ +lw t3, 4(a0)/* t3 = dst */ +MIPS_2xUN8x4_MUL_2xUN8x4 a1, a1, t0, t1, t4, t5, t9, s0, s1, s2, s3, s4, s5 +not t0, t0 +not t1, t1 +MIPS_2xUN8x4_MUL_2xUN8x4 t2, t3, t0, t1, s0, s1, t9, s2, s3, s4, s5, s6, s7 +addu_s.qbs0, t4, s0 +addu_s.qbs1, t5, s1 +21: +sw s0, 0(a0) +sw s1, 4(a0) +22: +addiua3, a3, -2 +addiut1, a3, -1 +bgtz t1, 2b + addiu a0, a0, 8 +3: +blez a3, 4f + nop + /* a1 = src */ +lw t1, 0(a2)/* t1 = mask */ +beqz t1, 4f + nop +move s0, t8 /* s0 = srca */ +move t2, a1 /* t2 = src */ +beq t1, t7, 31f + lw t0, 0(a0)/* t0 = dst */ + +