Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.

2012-03-12 Thread Søren Sandmann
Nemanja Lukic nlu...@mips.com writes:

 [ # ]  backend test   min(s) median(s) stddev. count
 [ # ]image: pixman 0.25.3
 [  0]imagexfce4-terminal-a1  138.223  139.070   0.33%6/6
 [ # ]  image16: pixman 0.25.3
 [  0]  image16xfce4-terminal-a1  132.763  132.939   0.06%5/6

I'm curious why you chose this particular benchmark? The main path that
xfce4-terminal-a1 exercises is over_n_1_ and add_1_1. As far as I
can tell it doesn't actually hit the two fast paths that you added,
which makes it suspicious where the speed-up is coming from.


Soren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.

2012-03-12 Thread Lukic, Nemanja
Hi Soren,

I usually select cairo-perf-trace that utilize optimized fast path the most.
In this case, xfce4-terminal-a1 proved to be that one. I use oprofile to check 
CPU utilization. Here is oprofile log I got for the xfce4-terminal-a1:

CPU: MIPS 74K, speed 0 MHz (estimated)
Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 
4
samples  %image name   app name symbol name
2658517  50.3337  no-vmlinux   no-vmlinux   /no-vmlinux
1216517  23.0323  libpixman-1.so   libpixman-1.so   
pixman_composite_over_n___ca_asm_mips
2709955.1308  libc-2.11.2.so   libc-2.11.2.so   memset
1650573.1250  libm-2.11.2.so   libm-2.11.2.so   floor
1398802.6483  libpixman-1.so   libpixman-1.so   
pixman_fill_buff32_mips_dsp
1363032.5806  libpixman-1.so   libpixman-1.so   
fetch_scanline_a8
61821 1.1705  libc-2.11.2.so   libc-2.11.2.so   memcpy
...

All other traces don't utilize this fast-path that much (this is what my 
oprofile runs on the test system showed).
If you know some more suitable trace (or system configuration I need to have, 
like fonts installed, etc), please let me know, and I'll re-run the benchmarks 
and update the commit.

Thanks,
Nemanja Lukic

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Monday, March 12, 2012 10:05 PM
To: Søren Sandmann
Cc: Lukic, Nemanja; pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com
Subject: Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n___ca and 
over_n__0565_ca fast paths.

On Mon, Mar 12, 2012 at 10:48 PM, Søren Sandmann sandm...@cs.au.dk wrote:
 Nemanja Lukic nlu...@mips.com writes:

 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image            xfce4-terminal-a1  138.223  139.070   0.33%    6/6
 [ # ]  image16: pixman 0.25.3
 [  0]  image16            xfce4-terminal-a1  132.763  132.939   0.06%    5/6

 I'm curious why you chose this particular benchmark? The main path that
 xfce4-terminal-a1 exercises is over_n_1_ and add_1_1. As far as I
 can tell it doesn't actually hit the two fast paths that you added,
 which makes it suspicious where the speed-up is coming from.

I think it may actually depend on what fonts are installed in the
system and I vaguely remember encountering this at least once. If the
suitable bitmap fonts are missing, then the benchmark might fallback
to some other font and exercise different fast paths.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.

2012-03-12 Thread Siarhei Siamashka
On Mon, Mar 12, 2012 at 11:20 PM, Lukic, Nemanja nlu...@mips.com wrote:
 Hi Soren,

 I usually select cairo-perf-trace that utilize optimized fast path the most.
 In this case, xfce4-terminal-a1 proved to be that one. I use oprofile to 
 check CPU utilization. Here is oprofile log I got for the xfce4-terminal-a1:

 CPU: MIPS 74K, speed 0 MHz (estimated)
 Counted CYCLES events (Cycles) with a unit mask of 0x00 (No unit mask) count 
 4
 samples  %        image name               app name                 symbol 
 name
 2658517  50.3337  no-vmlinux               no-vmlinux               
 /no-vmlinux
 1216517  23.0323  libpixman-1.so           libpixman-1.so           
 pixman_composite_over_n___ca_asm_mips
 270995    5.1308  libc-2.11.2.so           libc-2.11.2.so           memset
 165057 3.1250  libm-2.11.2.so           libm-2.11.2.so           floor
 139880    2.6483  libpixman-1.so           libpixman-1.so           
 pixman_fill_buff32_mips_dsp
 136303    2.5806  libpixman-1.so           libpixman-1.so           
 fetch_scanline_a8
 61821     1.1705  libc-2.11.2.so           libc-2.11.2.so           memcpy
 ...

 All other traces don't utilize this fast-path that much (this is what my 
 oprofile runs on the test system showed).
 If you know some more suitable trace (or system configuration I need to have, 
 like fonts installed, etc), please let me know, and I'll re-run the 
 benchmarks and update the commit.

You can try to install terminus font
(http://terminus-font.sourceforge.net/) just to check if this has any
effect on the fast paths used. However the trace will not be useful
for benchmarking your over_n___ca and over_n__0565_ca
optimizations any more. Anyway, the purpose of running benchmarks is
to confirm the performance improvement, so I guess this trace is also
fine even though it does not behave as originally intended.

By the way, oprofile logs are also quite informative and may be useful
as part of the commit message. By the way, it is a good idea to
configure oprofile to collect statistics separately per process
instead of the flat report for the whole system. This can be done in
the following way:

# opcontrol --deinit
# opcontrol --separate=kernel
# opcontrol --init

Then collect the statistics:

# opcontrol --reset
# opcontrol --start
# ./some-test-binary
# opcontrol --stop

And show it:

# opreport -l ./some-test-binary

When the statistics is collected per process, the idle time currently
attributed to no-vmlinux will disappear, the results should become
perfectly reproducible across multiple runs and can be also used to
evaluate the effect of optimizations.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


[Pixman] [PATCH] MIPS: DSPr2: Added over_n_8888_8888_ca and over_n_8888_0565_ca fast paths.

2012-03-11 Thread Nemanja Lukic
From: Nemanja Lukic nemanja.lu...@rt-rk.com

Performance numbers before/after on MIPS-74kc @ 1GHz

Referent (before):

lowlevel-blt-bench:
 over_n___ca =  L1:   8.32  L2:   7.65  M:  6.38 ( 51.08%)  HT:  
5.78  VT:  5.74  R:  5.84  RT:  4.39 (  37Kops/s)
 over_n__0565_ca =  L1:   7.40  L2:   6.95  M:  6.16 ( 41.06%)  HT:  
5.72  VT:  5.52  R:  5.63  RT:  4.28 (  36Kops/s)
cairo-perf-trace:
[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.25.3
[  0]imagexfce4-terminal-a1  138.223  139.070   0.33%6/6
[ # ]  image16: pixman 0.25.3
[  0]  image16xfce4-terminal-a1  132.763  132.939   0.06%5/6

Optimized:

lowlevel-blt-bench:
 over_n___ca =  L1:  19.35  L2:  23.84  M: 13.68 (109.39%)  HT: 
11.39  VT: 11.19  R: 11.27  RT:  6.90 (  47Kops/s)
 over_n__0565_ca =  L1:  18.68  L2:  17.00  M: 12.56 ( 83.70%)  HT: 
10.72  VT: 10.45  R: 10.43  RT:  5.79 (  43Kops/s)
cairo-perf-trace:
[ # ]  backend test   min(s) median(s) stddev. count
[ # ]image: pixman 0.25.3
[  0]imagexfce4-terminal-a1  130.400  131.720   0.46%6/6
[ # ]  image16: pixman 0.25.3
[  0]  image16xfce4-terminal-a1  125.830  126.604   0.34%6/6
---
 pixman/pixman-mips-dspr2-asm.S |  219 +
 pixman/pixman-mips-dspr2-asm.h |  296 
 pixman/pixman-mips-dspr2.c |   12 ++
 pixman/pixman-mips-dspr2.h |   42 ++
 4 files changed, 569 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-mips-dspr2-asm.S b/pixman/pixman-mips-dspr2-asm.S
index f1087a7..6a0fc18 100644
--- a/pixman/pixman-mips-dspr2-asm.S
+++ b/pixman/pixman-mips-dspr2-asm.S
@@ -308,3 +308,222 @@ LEAF_MIPS_DSPR2(pixman_composite_src_x888__asm_mips)
  nop
 
 END(pixman_composite_src_x888__asm_mips)
+
+LEAF_MIPS_DSPR2(pixman_composite_over_n___ca_asm_mips)
+/*
+ * a0 - dst  (a8r8g8b8)
+ * a1 - src  (32bit constant)
+ * a2 - mask (a8r8g8b8)
+ * a3 - w
+ */
+
+SAVE_REGS_ON_STACK 16, s0, s1, s2, s3, s4, s5, s6, s7
+beqz a3, 4f
+ nop
+li   t6, 0xff
+addiut7, zero, -1 /* t7 = 0x */
+srl  t8, a1, 24   /* t8 = srca */
+li   t9, 0x00ff00ff
+addiut1, a3, -1
+beqz t1, 3f   /* last pixel */
+ nop
+beq  t8, t6, 2f   /* if (srca == 0xff) */
+ nop
+1:
+  /* a1 = src */
+lw   t0, 0(a2)/* t0 = mask */
+lw   t1, 4(a2)/* t1 = mask */
+or   t2, t0, t1
+beqz t2, 12f  /* if (t0 == 0)  (t1 == 0) */
+ addiu   a2, a2, 8
+and  t3, t0, t1
+move s0, t8   /* s0 = srca */
+move s1, t8   /* s1 = srca */
+move t4, a1   /* t4 = src */
+move t5, a1   /* t5 = src */
+lw   t2, 0(a0)/* t2 = dst */
+beq  t3, t7, 11f  /* if (t0 == 0x)  (t1 == 0x) */
+ lw  t3, 4(a0)/* t0 = dst */
+MIPS_2xUN8x4_MUL_2xUN8x4 a1, a1, t0, t1, t4, t5, t9, s0, s1, s2, s3, s4, s5
+MIPS_2xUN8x4_MUL_2xUN8   t0, t1, t8, t8, s0, s1, t9, s2, s3, s4, s5, s6, s7
+11:
+not  s0, s0
+not  s1, s1
+MIPS_2xUN8x4_MUL_2xUN8x4 t2, t3, s0, s1, s2, s3, t9, t0, t1, s4, s5, s6, s7
+addu_s.qbt0, t4, s2
+addu_s.qbt1, t5, s3
+sw   t0, 0(a0)
+sw   t1, 4(a0)
+12:
+addiua3, a3, -2
+addiut1, a3, -1
+bgtz t1, 1b
+ addiu   a0, a0, 8
+b3f
+ nop
+2:
+  /* a1 = src */
+lw   t0, 0(a2)/* t0 = mask */
+lw   t1, 4(a2)/* t1 = mask */
+or   t2, t0, t1
+beqz t2, 22f  /* if (t0 == 0)  (t1 == 0) */
+ addiu   a2, a2, 8
+and  t2, t0, t1
+move s0, a1
+beq  t2, t7, 21f  /* if (t0 == 0x)  (t1 == 0x) */
+ moves1, a1
+lw   t2, 0(a0)/* t2 = dst */
+lw   t3, 4(a0)/* t3 = dst */
+MIPS_2xUN8x4_MUL_2xUN8x4 a1, a1, t0, t1, t4, t5, t9, s0, s1, s2, s3, s4, s5
+not  t0, t0
+not  t1, t1
+MIPS_2xUN8x4_MUL_2xUN8x4 t2, t3, t0, t1, s0, s1, t9, s2, s3, s4, s5, s6, s7
+addu_s.qbs0, t4, s0
+addu_s.qbs1, t5, s1
+21:
+sw   s0, 0(a0)
+sw   s1, 4(a0)
+22:
+addiua3, a3, -2
+addiut1, a3, -1
+bgtz t1, 2b
+ addiu   a0, a0, 8
+3:
+blez a3, 4f
+ nop
+  /* a1 = src */
+lw   t1, 0(a2)/* t1 = mask */
+beqz t1, 4f
+ nop
+move s0, t8   /* s0 = srca */
+move t2, a1   /* t2 = src */
+beq  t1, t7, 31f
+ lw  t0, 0(a0)/* t0 = dst */
+
+