Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-24 Thread Siarhei Siamashka
On Wed, May 23, 2012 at 9:13 PM, Søren Sandmann sandm...@cs.au.dk wrote:
 Lukic, Nemanja nlu...@mips.com writes:

 I added more explanation in the commit message for that commit.

 Thanks, I have pushed it to master (with some minor formatting changes)
 along with the bilinear optimization patch.

That's a good news. To be on a safe side, I have also verified that
the pixman tests pass with the git master
30816e3068bccf7c78c78f916b54971d24873bdc

Looks like the MIPS DSPr2 part is ready for the stable pixman release.

It would be also interesting (but of course not strictly necessary) to
get the MIPS DSPr2 performance improvement numbers summarised in one
table, similar to the one presented in:

http://mattst88.com/blog/2012/05/17/Optimizing_pixman_for_Loongson:_Process_and_Results/
I can try to run the full set of cairo-perf-trace benchmarks, but my
gigabit router with MIPS 74K processor only has 128MiB of RAM and some
of the traces may run out of memory. Also it may take an eternity to
run till completion :)

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-23 Thread Lukic, Nemanja
Hi Soren,

I added more explanation in the commit message for that commit.
Sorry for the late reply. If you think that anything is missing in the commit 
message, please tell me, and I'll update it.

Thanks,
Nemanja Lukic

-Original Message-
From: pixman-bounces+nlukic=mips@lists.freedesktop.org 
[mailto:pixman-bounces+nlukic=mips@lists.freedesktop.org] On Behalf Of 
Søren Sandmann
Sent: Wednesday, May 23, 2012 6:51 AM
To: Siarhei Siamashka
Cc: pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com
Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ 
fast path.

Siarhei Siamashka siarhei.siamas...@gmail.com writes:

 On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka
 siarhei.siamas...@gmail.com wrote:
 On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com 
 wrote:
 Is this small improvement worth making this code vulnerable to endian 
 issues?

 If you are already satisfied with this level of performance, then it's
 probably fine for now.

 By the way, I really mean it :) In my opinion, it is generally enough
 that the patches are useful for something and do not cause
 regressions. If implementing additional performance tweaks may take
 too much time, then they can be added later. But it is also important
 to realize that there is still some room for improvement and not to
 drop the optimization work half-way.

 Also maybe you have noticed that pixman-0.26.0 is about to be released
 next week:
 http://lists.freedesktop.org/archives/pixman/2012-May/001969.html
 We still need to either fix the bug which causes the test suite
 failure for MIPS DSP ASE. Or at least disable problematic
 optimizations for this stable release.

Yeah, unless someone who understands the fix here:

   http://lists.freedesktop.org/archives/pixman/2012-May/001932.html

comes up with a commit message, I'll just revert the optimization before
releasing.


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-23 Thread Lukic, Nemanja
Hi Siarhei,

Your comments for the bilinear commit are truly worth a lot more investigation, 
and you did made a strong point that these optimizations should be revisited.
But, for now, these tweaks do need much time to implement, and may also 
influence not only bilinear commit.
Since I do plan to push more fast paths (both bilinear, and SRC/OVER/ADD) for 
MIPS DSP, I thinks we should use this bilinear patch as-it-is for now (for this 
pixman release), 
since it shows good performance increase, and doesn't show any regressions, but 
for sure I'll come back with new commit that will include tweaks you suggested, 
and improve existing commit(s).

Thanks,
Nemanja Lukic

-Original Message-
From: pixman-bounces+nlukic=mips@lists.freedesktop.org 
[mailto:pixman-bounces+nlukic=mips@lists.freedesktop.org] On Behalf Of 
Siarhei Siamashka
Sent: Sunday, May 20, 2012 11:31 PM
To: nemanja.lu...@rt-rk.com
Cc: pixman@lists.freedesktop.org
Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ 
fast path.

On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com 
 wrote:
 Is this small improvement worth making this code vulnerable to endian issues?

 If you are already satisfied with this level of performance, then it's
 probably fine for now.

By the way, I really mean it :) In my opinion, it is generally enough
that the patches are useful for something and do not cause
regressions. If implementing additional performance tweaks may take
too much time, then they can be added later. But it is also important
to realize that there is still some room for improvement and not to
drop the optimization work half-way.

Also maybe you have noticed that pixman-0.26.0 is about to be released
next week:
http://lists.freedesktop.org/archives/pixman/2012-May/001969.html
We still need to either fix the bug which causes the test suite
failure for MIPS DSP ASE. Or at least disable problematic
optimizations for this stable release.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-23 Thread Søren Sandmann
Lukic, Nemanja nlu...@mips.com writes:

 I added more explanation in the commit message for that commit.

Thanks, I have pushed it to master (with some minor formatting changes)
along with the bilinear optimization patch.


Thanks,
Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-20 Thread Siarhei Siamashka
On Wed, May 16, 2012 at 5:27 AM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 +/*
 + * Multiply pixel (a8) with single pixel (a8r8g8b8). It requires maskLSR
 + * needed for rounding process. maskLSR must have following value:
 + *   li       maskLSR, 0x00ff00ff
 + */
 +.macro MIPS_UN8x4_MUL_UN8 s_,  \
 +                          m_8,     \
 +                          d_,  \
 +                          maskLSR, \
 +                          scratch1, scratch2, scratch3
 +    replv.ph          \m_8,      \m_8                 /*   0 | M | 0 | M */
 +    muleu_s.ph.qbl    \scratch1, \s_,   \m_8      /*    A*M  |  R*M */
 +    muleu_s.ph.qbr    \scratch2, \s_,   \m_8      /*    G*M  |  B*M */
 +    shra_r.ph         \scratch3, \scratch1, 8
 +    shra_r.ph         \d_,   \scratch2, 8
 +    and               \scratch3, \scratch3, \maskLSR  /*   0 |A*M| 0 |R*M */
 +    and               \d_,   \d_,   \maskLSR  /*   0 |G*M| 0 |B*M */
 +    addq.ph           \scratch1, \scratch1, \scratch3 /* A*M+A*M | R*M+R*M */
 +    addq.ph           \scratch2, \scratch2, \d_   /* G*M+G*M | B*M+B*M */
 +    shra_r.ph         \scratch1, \scratch1, 8
 +    shra_r.ph         \scratch2, \scratch2, 8
 +    precr.qb.ph       \d_,   \scratch1, \scratch2
 +.endm

 A possible alternative way is to just use a single MULQ_RS.W
 instruction for each color component. That's total 5 instructions
 because 8-bit alpha value from mask needs to be premultiplied by
 8421504. A test program is listed below:

 /***/

 #include stdio.h
 #include stdint.h

 int mul_un8(int a, int b)
 {
 #if 1
    int t = a * b + 0x80;
    return (t + (t  8))  8;
 #else
    return (a * b + 127) / 255;
 #endif
 }

 int mul_un8_mips(int a, int b)
 {
    int c;
    b *= 8421504;
 #if 1
    asm (mulq_rs.w %0, %1, %2 : =r (c) : r (a), r (b));
 #else
    c = ((int64_t)a * b + (1  30))  31;
 #endif
    return c;
 }

 int main()
 {
    int a, b;
    for (a = 0; a  256; a++)
    {
        for (b = 0; b  256; b++)
        {
            if (mul_un8(a, b) != mul_un8_mips(a, b))
            {
                printf(test failed! a=%d b=%d\n, a, b);
                return 1;
            }
        }
    }
    printf(test passed\n);
    return 0;
 }

 /***/

 There is only one problem with MULQ_RS.W instruction: it seems to have
 a huge ~14 (!) cycles latency. But the throughput is ok (1 cycle per
 instruction). So in order to hide this latency, some serious loop
 unrolling may be needed (possibly up to handling 4 pixels at once).
 The other MIPS DSP ASE fast path functions may also try to use
 MULQ_RS.W

Maybe a bit more explanations would be useful. When multiplying 8-bit
color component by 8-bit alpha for OVER operator, pixman actually
wants to do:

x' = (x * a + (255 / 2)) / 255;

This is division by 255 and rounding the result to nearest integer.
Because integer division is slow, C implementation replaces it with
shifts and additions:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n27

t  = x * a + 0x80;
x' = (t + (t  8))  8;

This method of calculation is also good because the intermediate
results fit unsigned 16-bit variables. Which also allows to use
SIMD-alike trick to process two color components at once on 32-bit
systems:

http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n40

/* xy = 0x00AB00CD, where AB is one color component, CD - another  */
t   = xy * a + 0x00800080;
xy` = ((t + ((t  8)  0x00FF00FF))  8)  0x00FF00FF;

But modern processors may support some nice instructions which can do
this job even better. It makes sense to revert C optimizations and
look at the original ((x * a + (255 / 2)) / 255) formula again.

MIPS DSP ASE has a special instruction MULQ_RS.W for rounded fixed
point Q31 multiplication (((int64_t)a * b + (1  30))  31) and it
can be used quite conveniently here because we get a shift and
rounding for free. We just need to use the Q31 representation of 1/255
for multiplication by reciprocal. MIPS DSP ASE also supports Q15 fixed
point multiplication and it could have been even nicer, but Q15
precision is apparently insufficient for getting bit exact results in
this case.

Looking at the original intended formula is always useful. Because it
allows to try and benchmark alternative implementations of the same
calculations, selecting the best one for the target hardware. Maybe
MIPS r5g6b5 - x8r8g8b8 pixel format conversion could also borrow some
ideas from the other discussion thread:
http://lists.freedesktop.org/archives/pixman/2012-May/001958.html

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-20 Thread Siarhei Siamashka
On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com 
 wrote:
 Is this small improvement worth making this code vulnerable to endian issues?

 If you are already satisfied with this level of performance, then it's
 probably fine for now.

By the way, I really mean it :) In my opinion, it is generally enough
that the patches are useful for something and do not cause
regressions. If implementing additional performance tweaks may take
too much time, then they can be added later. But it is also important
to realize that there is still some room for improvement and not to
drop the optimization work half-way.

Also maybe you have noticed that pixman-0.26.0 is about to be released
next week:
http://lists.freedesktop.org/archives/pixman/2012-May/001969.html
We still need to either fix the bug which causes the test suite
failure for MIPS DSP ASE. Or at least disable problematic
optimizations for this stable release.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-15 Thread Lukic, Nemanja
 \alpha,\alpha, \red
 precr.qb.ph \scratch1, \green, \blue
-precrq.qb.ph\tl,   \alpha, \scratch1
+precrq.qb.ph\top,  \alpha, \scratch1
 .endm
 
 #endif //PIXMAN_MIPS_DSPR2_ASM_H

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Friday, May 11, 2012 10:55 AM
To: Lukic, Nemanja
Cc: pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com
Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ 
fast path.

On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote:
 From: Nemanja Lukic nemanja.lu...@rt-rk.com

 Performance numbers before/after on MIPS-74kc @ 1GHz

 Referent (before):

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6

 Optimized:

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6

This definitely is an improvement. But the firefox-fishtank trace is
very dependent on bilinear scaling performance, both x86 SSE2 and ARM
NEON demonstrate more than 3x speedup here:
http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html

I understand that MIPS DSPr2 does not stand a chance competing with
128-bit SIMD competitors, but still some more performance tweaks can
be be probably applied. See more comments below.

 diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h
 index 8383060..7cf3281 100644
 --- a/pixman/pixman-mips-dspr2-asm.h
 +++ b/pixman/pixman-mips-dspr2-asm.h
 @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol)                                   \
     addu_s.qb              \out2_, \d2_,  \scratch2
  .endm

 +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br,         \
 +                                         scratch1, scratch2,     \
 +                                         alpha, red, green, blue \
 +                                         wt1, wt2, wb1, wb2
 +    andi            \scratch1, \tl,  0xff
 +    andi            \scratch2, \tr,  0xff
 +    andi            \alpha,    \bl,  0xff
 +    andi            \red,      \br,  0xff

I suggest to have a look at
http://lists.freedesktop.org/archives/pixman/2011-February/001088.html

The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro
could be replaced with byte load instructions. MIPS74K can't dual
issue ALU+ALU instructions, but can dual issue LS+ALU. This look like
a potentially huge performance win on MIPS74K hardware, far exceeding
the speedup observed on x86.

Why is the faster C bilinear code from my old post still not in
pixman? As I mentioned there, the discussion is still ongoing about
how to improve bilinear scaling performance when SIMD extensions are
not available. Reducing interpolation precision from the current
8-bit to 7-bit allows to use signed multiplications and can help a lot
x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation
precision further to 4-bit as suggested by Taekyun Kim at that time:
http://lists.freedesktop.org/archives/pixman/2011-February/001044.html
This allows to halve the number of multiplications for bilinear
interpolation in C code by using SIMD-alike tricks.

But both Taekyun Kim and I were mostly interested in ARM NEON
performance, and NEON happens not to suffer from 8-bit interpolation
much. Nobody else has tried pushing interpolation precision reduction
for faster bilinear interpolation into pixman and  it did not
happen. But the hope is not totally lost, see the recent discussion:
http://lists.freedesktop.org/archives/pixman/2012-May/001930.html

Regarding how it affects you. If bilinear interpolation precision gets
changed after all, your optimized code in bilinear over__8_
fast path will need to be updated (if we still care about getting
identical results everywhere and passing the test suite). You may also
want to take part in this activity and evaluate the effects of 8-bit
vs. 7-bit vs. 4-bit interpolation for MIPS.

 +    multu           $ac0,      \wt1, \scratch1
 +    maddu           $ac0,      \wt2, \scratch2
 +    maddu           $ac0,      \wb1, \alpha
 +    maddu           $ac0,      \wb2, \red
 +
 +    ext             \scratch1, \tl,  8, 8
 +    ext             \scratch2, \tr,  8, 8
 +    ext             \alpha,    \bl,  8, 8
 +    ext             \red,      \br,  8, 8
 +
 +    multu           $ac1,      \wt1, \scratch1
 +    maddu           $ac1,      \wt2, \scratch2
 +    maddu           $ac1,      \wb1, \alpha
 +    maddu           $ac1,      \wb2, \red
 +
 +    ext             \scratch1, \tl,  16, 8
 +    ext             \scratch2, \tr,  16, 8
 +    ext             \alpha,    \bl,  16, 8
 +    ext             \red,      \br,  16, 8
 +
 +    mflo

Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-15 Thread Siarhei Siamashka
On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com wrote:
 Hi Siarhei,

 I implemented a new version of the (patch below) 
 BILINEAR_INTERPOLATE_SINGLE_PIXEL macro where ANDI/EXT instructions,
 are substituted with load byte instructions (for better dual-issue 
 instruction balancing) and got these results on my Malta board:

 Original:
 [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6
 Opt (ANDI/EXT)
 [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6
 Opt2 (load byte instructions)
 [  0]    image             firefox-fishtank 1671.700 1672.006   0.03%    4/4

 There is performance improvement, but not impressive as I expected.

This is interesting. Have you tried to check where the time is
actually spent and what is the performance bottleneck?

If the code from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro is taken and
put into unrolled loop so that the total number of iterations is equal
to CPU clock frequency, then we get:

MIPS 74K and ANDI/EXT variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL:

real0m40.279s
user0m40.260s
sys 0m0.006s

MIPS 74K and LBU variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL:

real0m26.479s
user0m26.468s
sys 0m0.006s

That's ~40 cycles vs. ~26.5 cycles, or approximately ~13.5 cycles
saving by replacing 16 ALU instructions with 16 LS instructions.
Ideally we would want to have perfect dual issue here and 16 cycles
saving, but at least dual issue works.

 And now code also becomes vulnerable to endianess of the target CPUs.
 Of course, this can be guarded with some #ifdef's where byte offset in a word 
 is changed according to the endianess of the target CPU (since MIPS CPUs can 
 be both LE and BE).
 Is this small improvement worth making this code vulnerable to endian issues?

If you are already satisfied with this level of performance, then it's
probably fine for now.

 I still need to add improvement for that packing/unpacking of the RGBA pixels 
 after bilinear/before OVER operation, but I don't expect big improvement 
 there (it is just a couple of instructions).

It's not just a couple of instructions. By combining the color
channels in a register, you are also forcing the processor to finish
the calculations for all the needed data. And this is an extra data
dependency, which may inhibit instructions reordering.

But big improvements are not likely to happen unless there is a clear
understanding about what is going on in the CPU pipeline and
accounting each spent cycle.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-15 Thread Matt Turner
On Tue, May 15, 2012 at 5:37 PM, Siarhei Siamashka
siarhei.siamas...@gmail.com wrote:
 I still need to add improvement for that packing/unpacking of the RGBA 
 pixels after bilinear/before OVER operation, but I don't expect big 
 improvement there (it is just a couple of instructions).

 It's not just a couple of instructions. By combining the color
 channels in a register, you are also forcing the processor to finish
 the calculations for all the needed data. And this is an extra data
 dependency, which may inhibit instructions reordering.

Indeed. For instance this is part of the reason why [1] made such a
large difference.

[1] 
http://cgit.freedesktop.org/pixman/commit/?id=7d4beedc612a32b73d7673bbf6447de0f3fca298
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-11 Thread Siarhei Siamashka
On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote:
 From: Nemanja Lukic nemanja.lu...@rt-rk.com

 Performance numbers before/after on MIPS-74kc @ 1GHz

 Referent (before):

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6

 Optimized:

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6

This definitely is an improvement. But the firefox-fishtank trace is
very dependent on bilinear scaling performance, both x86 SSE2 and ARM
NEON demonstrate more than 3x speedup here:
http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html

I understand that MIPS DSPr2 does not stand a chance competing with
128-bit SIMD competitors, but still some more performance tweaks can
be be probably applied. See more comments below.

 diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h
 index 8383060..7cf3281 100644
 --- a/pixman/pixman-mips-dspr2-asm.h
 +++ b/pixman/pixman-mips-dspr2-asm.h
 @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol)                                   \
     addu_s.qb              \out2_, \d2_,  \scratch2
  .endm

 +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br,         \
 +                                         scratch1, scratch2,     \
 +                                         alpha, red, green, blue \
 +                                         wt1, wt2, wb1, wb2
 +    andi            \scratch1, \tl,  0xff
 +    andi            \scratch2, \tr,  0xff
 +    andi            \alpha,    \bl,  0xff
 +    andi            \red,      \br,  0xff

I suggest to have a look at
http://lists.freedesktop.org/archives/pixman/2011-February/001088.html

The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro
could be replaced with byte load instructions. MIPS74K can't dual
issue ALU+ALU instructions, but can dual issue LS+ALU. This look like
a potentially huge performance win on MIPS74K hardware, far exceeding
the speedup observed on x86.

Why is the faster C bilinear code from my old post still not in
pixman? As I mentioned there, the discussion is still ongoing about
how to improve bilinear scaling performance when SIMD extensions are
not available. Reducing interpolation precision from the current
8-bit to 7-bit allows to use signed multiplications and can help a lot
x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation
precision further to 4-bit as suggested by Taekyun Kim at that time:
http://lists.freedesktop.org/archives/pixman/2011-February/001044.html
This allows to halve the number of multiplications for bilinear
interpolation in C code by using SIMD-alike tricks.

But both Taekyun Kim and I were mostly interested in ARM NEON
performance, and NEON happens not to suffer from 8-bit interpolation
much. Nobody else has tried pushing interpolation precision reduction
for faster bilinear interpolation into pixman and  it did not
happen. But the hope is not totally lost, see the recent discussion:
http://lists.freedesktop.org/archives/pixman/2012-May/001930.html

Regarding how it affects you. If bilinear interpolation precision gets
changed after all, your optimized code in bilinear over__8_
fast path will need to be updated (if we still care about getting
identical results everywhere and passing the test suite). You may also
want to take part in this activity and evaluate the effects of 8-bit
vs. 7-bit vs. 4-bit interpolation for MIPS.

 +    multu           $ac0,      \wt1, \scratch1
 +    maddu           $ac0,      \wt2, \scratch2
 +    maddu           $ac0,      \wb1, \alpha
 +    maddu           $ac0,      \wb2, \red
 +
 +    ext             \scratch1, \tl,  8, 8
 +    ext             \scratch2, \tr,  8, 8
 +    ext             \alpha,    \bl,  8, 8
 +    ext             \red,      \br,  8, 8
 +
 +    multu           $ac1,      \wt1, \scratch1
 +    maddu           $ac1,      \wt2, \scratch2
 +    maddu           $ac1,      \wb1, \alpha
 +    maddu           $ac1,      \wb2, \red
 +
 +    ext             \scratch1, \tl,  16, 8
 +    ext             \scratch2, \tr,  16, 8
 +    ext             \alpha,    \bl,  16, 8
 +    ext             \red,      \br,  16, 8
 +
 +    mflo            \blue,     $ac0
 +
 +    multu           $ac2,      \wt1, \scratch1
 +    maddu           $ac2,      \wt2, \scratch2
 +    maddu           $ac2,      \wb1, \alpha
 +    maddu           $ac2,      \wb2, \red
 +
 +    ext             \scratch1, \tl,  24, 8
 +    ext             \scratch2, \tr,  24, 8
 +    ext             \alpha,    \bl,  24, 8
 +    ext             \red,      \br,  24, 8
 +
 +    mflo            \green,    $ac1
 +
 +    multu           $ac3,      \wt1, \scratch1
 +    maddu           

Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-11 Thread Siarhei Siamashka
On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote:
 From: Nemanja Lukic nemanja.lu...@rt-rk.com

 Performance numbers before/after on MIPS-74kc @ 1GHz

 Referent (before):

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6

 Optimized:

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6

BTW, it is also really easy to tweak lowlevel-blt-bench to also
optionally benchmark NEAREST and BILINEAR scaling and get some
synthetic MPix/s statistics for these cases. The only thing which
needs to be done is to simply apply an almost-identity matrix to the
source image. It will prevent pixman from using unscaled fast paths,
while having only minimal effect on the result of actual compositing
operations (so unscaled MPix/s numbers can be even directly compared
with scaled MPix/s for every operation).

Anybody willing to submit a patch?

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.

2012-05-11 Thread Lukic, Nemanja
Hi Siarhei,

You are right, 74k core is out-of-order dual-issue core (LS+ALU).
Using byte load instructions instead of the ANDI/EXT is nice tweak to try, with 
potential big performance improvement.
I'll benchmark this, and upload a new patch (combined with the better-commented 
commit for the fix I pushed also, for over_n___ca/over_n__0565_ca 
routines).

Combining RGBA pixels at the end of the BILINEAR_INTERPOLATE_SINGLE_PIXEL macro 
and later splitting them again in OVER__8_ macro is consequence of my 
intention of using BILINEAR_INTERPOLATE_SINGLE_PIXEL in more bilinear routines, 
like bilinear_scanline_/0565_/0565_SRC/ADD where pixels once packed to 
RGBA, don't need to be unpacked any more. But maybe something like a macro 
parameter which will tell if pixels should or should not be combined in RGBA, 
could be added to this macro. I'll look into this.

Best Regards,
Nemanja Lukic

-Original Message-
From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] 
Sent: Friday, May 11, 2012 10:55 AM
To: Nemanja Lukic
Cc: pixman@lists.freedesktop.org; Nemanja Lukic
Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ 
fast path.

On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote:
 From: Nemanja Lukic nemanja.lu...@rt-rk.com

 Performance numbers before/after on MIPS-74kc @ 1GHz

 Referent (before):

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 2289.180 2290.567   0.05%    5/6

 Optimized:

 cairo-perf-trace:
 [ # ]  backend                         test   min(s) median(s) stddev. count
 [ # ]    image: pixman 0.25.3
 [  0]    image             firefox-fishtank 1700.925 1708.314   0.22%    5/6

This definitely is an improvement. But the firefox-fishtank trace is
very dependent on bilinear scaling performance, both x86 SSE2 and ARM
NEON demonstrate more than 3x speedup here:
http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html

I understand that MIPS DSPr2 does not stand a chance competing with
128-bit SIMD competitors, but still some more performance tweaks can
be be probably applied. See more comments below.

 diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h
 index 8383060..7cf3281 100644
 --- a/pixman/pixman-mips-dspr2-asm.h
 +++ b/pixman/pixman-mips-dspr2-asm.h
 @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol)                                   \
     addu_s.qb              \out2_, \d2_,  \scratch2
  .endm

 +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br,         \
 +                                         scratch1, scratch2,     \
 +                                         alpha, red, green, blue \
 +                                         wt1, wt2, wb1, wb2
 +    andi            \scratch1, \tl,  0xff
 +    andi            \scratch2, \tr,  0xff
 +    andi            \alpha,    \bl,  0xff
 +    andi            \red,      \br,  0xff

I suggest to have a look at
http://lists.freedesktop.org/archives/pixman/2011-February/001088.html

The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro
could be replaced with byte load instructions. MIPS74K can't dual
issue ALU+ALU instructions, but can dual issue LS+ALU. This look like
a potentially huge performance win on MIPS74K hardware, far exceeding
the speedup observed on x86.

Why is the faster C bilinear code from my old post still not in
pixman? As I mentioned there, the discussion is still ongoing about
how to improve bilinear scaling performance when SIMD extensions are
not available. Reducing interpolation precision from the current
8-bit to 7-bit allows to use signed multiplications and can help a lot
x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation
precision further to 4-bit as suggested by Taekyun Kim at that time:
http://lists.freedesktop.org/archives/pixman/2011-February/001044.html
This allows to halve the number of multiplications for bilinear
interpolation in C code by using SIMD-alike tricks.

But both Taekyun Kim and I were mostly interested in ARM NEON
performance, and NEON happens not to suffer from 8-bit interpolation
much. Nobody else has tried pushing interpolation precision reduction
for faster bilinear interpolation into pixman and  it did not
happen. But the hope is not totally lost, see the recent discussion:
http://lists.freedesktop.org/archives/pixman/2012-May/001930.html

Regarding how it affects you. If bilinear interpolation precision gets
changed after all, your optimized code in bilinear over__8_
fast path will need to be updated (if we still care about getting
identical results everywhere and passing the test suite). You may also
want to take part in this activity and evaluate the effects of 8-bit
vs. 7-bit vs. 4-bit interpolation for MIPS.

 +    multu           $ac0,      \wt1, \scratch1
 +    maddu