Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Wed, May 23, 2012 at 9:13 PM, Søren Sandmann sandm...@cs.au.dk wrote: Lukic, Nemanja nlu...@mips.com writes: I added more explanation in the commit message for that commit. Thanks, I have pushed it to master (with some minor formatting changes) along with the bilinear optimization patch. That's a good news. To be on a safe side, I have also verified that the pixman tests pass with the git master 30816e3068bccf7c78c78f916b54971d24873bdc Looks like the MIPS DSPr2 part is ready for the stable pixman release. It would be also interesting (but of course not strictly necessary) to get the MIPS DSPr2 performance improvement numbers summarised in one table, similar to the one presented in: http://mattst88.com/blog/2012/05/17/Optimizing_pixman_for_Loongson:_Process_and_Results/ I can try to run the full set of cairo-perf-trace benchmarks, but my gigabit router with MIPS 74K processor only has 128MiB of RAM and some of the traces may run out of memory. Also it may take an eternity to run till completion :) -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
Hi Soren, I added more explanation in the commit message for that commit. Sorry for the late reply. If you think that anything is missing in the commit message, please tell me, and I'll update it. Thanks, Nemanja Lukic -Original Message- From: pixman-bounces+nlukic=mips@lists.freedesktop.org [mailto:pixman-bounces+nlukic=mips@lists.freedesktop.org] On Behalf Of Søren Sandmann Sent: Wednesday, May 23, 2012 6:51 AM To: Siarhei Siamashka Cc: pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ fast path. Siarhei Siamashka siarhei.siamas...@gmail.com writes: On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com wrote: Is this small improvement worth making this code vulnerable to endian issues? If you are already satisfied with this level of performance, then it's probably fine for now. By the way, I really mean it :) In my opinion, it is generally enough that the patches are useful for something and do not cause regressions. If implementing additional performance tweaks may take too much time, then they can be added later. But it is also important to realize that there is still some room for improvement and not to drop the optimization work half-way. Also maybe you have noticed that pixman-0.26.0 is about to be released next week: http://lists.freedesktop.org/archives/pixman/2012-May/001969.html We still need to either fix the bug which causes the test suite failure for MIPS DSP ASE. Or at least disable problematic optimizations for this stable release. Yeah, unless someone who understands the fix here: http://lists.freedesktop.org/archives/pixman/2012-May/001932.html comes up with a commit message, I'll just revert the optimization before releasing. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
Hi Siarhei, Your comments for the bilinear commit are truly worth a lot more investigation, and you did made a strong point that these optimizations should be revisited. But, for now, these tweaks do need much time to implement, and may also influence not only bilinear commit. Since I do plan to push more fast paths (both bilinear, and SRC/OVER/ADD) for MIPS DSP, I thinks we should use this bilinear patch as-it-is for now (for this pixman release), since it shows good performance increase, and doesn't show any regressions, but for sure I'll come back with new commit that will include tweaks you suggested, and improve existing commit(s). Thanks, Nemanja Lukic -Original Message- From: pixman-bounces+nlukic=mips@lists.freedesktop.org [mailto:pixman-bounces+nlukic=mips@lists.freedesktop.org] On Behalf Of Siarhei Siamashka Sent: Sunday, May 20, 2012 11:31 PM To: nemanja.lu...@rt-rk.com Cc: pixman@lists.freedesktop.org Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ fast path. On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com wrote: Is this small improvement worth making this code vulnerable to endian issues? If you are already satisfied with this level of performance, then it's probably fine for now. By the way, I really mean it :) In my opinion, it is generally enough that the patches are useful for something and do not cause regressions. If implementing additional performance tweaks may take too much time, then they can be added later. But it is also important to realize that there is still some room for improvement and not to drop the optimization work half-way. Also maybe you have noticed that pixman-0.26.0 is about to be released next week: http://lists.freedesktop.org/archives/pixman/2012-May/001969.html We still need to either fix the bug which causes the test suite failure for MIPS DSP ASE. Or at least disable problematic optimizations for this stable release. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
Lukic, Nemanja nlu...@mips.com writes: I added more explanation in the commit message for that commit. Thanks, I have pushed it to master (with some minor formatting changes) along with the bilinear optimization patch. Thanks, Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Wed, May 16, 2012 at 5:27 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: +/* + * Multiply pixel (a8) with single pixel (a8r8g8b8). It requires maskLSR + * needed for rounding process. maskLSR must have following value: + * li maskLSR, 0x00ff00ff + */ +.macro MIPS_UN8x4_MUL_UN8 s_, \ + m_8, \ + d_, \ + maskLSR, \ + scratch1, scratch2, scratch3 + replv.ph \m_8, \m_8 /* 0 | M | 0 | M */ + muleu_s.ph.qbl \scratch1, \s_, \m_8 /* A*M | R*M */ + muleu_s.ph.qbr \scratch2, \s_, \m_8 /* G*M | B*M */ + shra_r.ph \scratch3, \scratch1, 8 + shra_r.ph \d_, \scratch2, 8 + and \scratch3, \scratch3, \maskLSR /* 0 |A*M| 0 |R*M */ + and \d_, \d_, \maskLSR /* 0 |G*M| 0 |B*M */ + addq.ph \scratch1, \scratch1, \scratch3 /* A*M+A*M | R*M+R*M */ + addq.ph \scratch2, \scratch2, \d_ /* G*M+G*M | B*M+B*M */ + shra_r.ph \scratch1, \scratch1, 8 + shra_r.ph \scratch2, \scratch2, 8 + precr.qb.ph \d_, \scratch1, \scratch2 +.endm A possible alternative way is to just use a single MULQ_RS.W instruction for each color component. That's total 5 instructions because 8-bit alpha value from mask needs to be premultiplied by 8421504. A test program is listed below: /***/ #include stdio.h #include stdint.h int mul_un8(int a, int b) { #if 1 int t = a * b + 0x80; return (t + (t 8)) 8; #else return (a * b + 127) / 255; #endif } int mul_un8_mips(int a, int b) { int c; b *= 8421504; #if 1 asm (mulq_rs.w %0, %1, %2 : =r (c) : r (a), r (b)); #else c = ((int64_t)a * b + (1 30)) 31; #endif return c; } int main() { int a, b; for (a = 0; a 256; a++) { for (b = 0; b 256; b++) { if (mul_un8(a, b) != mul_un8_mips(a, b)) { printf(test failed! a=%d b=%d\n, a, b); return 1; } } } printf(test passed\n); return 0; } /***/ There is only one problem with MULQ_RS.W instruction: it seems to have a huge ~14 (!) cycles latency. But the throughput is ok (1 cycle per instruction). So in order to hide this latency, some serious loop unrolling may be needed (possibly up to handling 4 pixels at once). The other MIPS DSP ASE fast path functions may also try to use MULQ_RS.W Maybe a bit more explanations would be useful. When multiplying 8-bit color component by 8-bit alpha for OVER operator, pixman actually wants to do: x' = (x * a + (255 / 2)) / 255; This is division by 255 and rounding the result to nearest integer. Because integer division is slow, C implementation replaces it with shifts and additions: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n27 t = x * a + 0x80; x' = (t + (t 8)) 8; This method of calculation is also good because the intermediate results fit unsigned 16-bit variables. Which also allows to use SIMD-alike trick to process two color components at once on 32-bit systems: http://cgit.freedesktop.org/pixman/tree/pixman/pixman-combine.h.template?id=pixman-0.24.4#n40 /* xy = 0x00AB00CD, where AB is one color component, CD - another */ t = xy * a + 0x00800080; xy` = ((t + ((t 8) 0x00FF00FF)) 8) 0x00FF00FF; But modern processors may support some nice instructions which can do this job even better. It makes sense to revert C optimizations and look at the original ((x * a + (255 / 2)) / 255) formula again. MIPS DSP ASE has a special instruction MULQ_RS.W for rounded fixed point Q31 multiplication (((int64_t)a * b + (1 30)) 31) and it can be used quite conveniently here because we get a shift and rounding for free. We just need to use the Q31 representation of 1/255 for multiplication by reciprocal. MIPS DSP ASE also supports Q15 fixed point multiplication and it could have been even nicer, but Q15 precision is apparently insufficient for getting bit exact results in this case. Looking at the original intended formula is always useful. Because it allows to try and benchmark alternative implementations of the same calculations, selecting the best one for the target hardware. Maybe MIPS r5g6b5 - x8r8g8b8 pixel format conversion could also borrow some ideas from the other discussion thread: http://lists.freedesktop.org/archives/pixman/2012-May/001958.html -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Wed, May 16, 2012 at 12:37 AM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com wrote: Is this small improvement worth making this code vulnerable to endian issues? If you are already satisfied with this level of performance, then it's probably fine for now. By the way, I really mean it :) In my opinion, it is generally enough that the patches are useful for something and do not cause regressions. If implementing additional performance tweaks may take too much time, then they can be added later. But it is also important to realize that there is still some room for improvement and not to drop the optimization work half-way. Also maybe you have noticed that pixman-0.26.0 is about to be released next week: http://lists.freedesktop.org/archives/pixman/2012-May/001969.html We still need to either fix the bug which causes the test suite failure for MIPS DSP ASE. Or at least disable problematic optimizations for this stable release. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
\alpha,\alpha, \red precr.qb.ph \scratch1, \green, \blue -precrq.qb.ph\tl, \alpha, \scratch1 +precrq.qb.ph\top, \alpha, \scratch1 .endm #endif //PIXMAN_MIPS_DSPR2_ASM_H -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Friday, May 11, 2012 10:55 AM To: Lukic, Nemanja Cc: pixman@lists.freedesktop.org; nemanja.lu...@rt-rk.com Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ fast path. On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote: From: Nemanja Lukic nemanja.lu...@rt-rk.com Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Optimized: cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 This definitely is an improvement. But the firefox-fishtank trace is very dependent on bilinear scaling performance, both x86 SSE2 and ARM NEON demonstrate more than 3x speedup here: http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html I understand that MIPS DSPr2 does not stand a chance competing with 128-bit SIMD competitors, but still some more performance tweaks can be be probably applied. See more comments below. diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h index 8383060..7cf3281 100644 --- a/pixman/pixman-mips-dspr2-asm.h +++ b/pixman/pixman-mips-dspr2-asm.h @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol) \ addu_s.qb \out2_, \d2_, \scratch2 .endm +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br, \ + scratch1, scratch2, \ + alpha, red, green, blue \ + wt1, wt2, wb1, wb2 + andi \scratch1, \tl, 0xff + andi \scratch2, \tr, 0xff + andi \alpha, \bl, 0xff + andi \red, \br, 0xff I suggest to have a look at http://lists.freedesktop.org/archives/pixman/2011-February/001088.html The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro could be replaced with byte load instructions. MIPS74K can't dual issue ALU+ALU instructions, but can dual issue LS+ALU. This look like a potentially huge performance win on MIPS74K hardware, far exceeding the speedup observed on x86. Why is the faster C bilinear code from my old post still not in pixman? As I mentioned there, the discussion is still ongoing about how to improve bilinear scaling performance when SIMD extensions are not available. Reducing interpolation precision from the current 8-bit to 7-bit allows to use signed multiplications and can help a lot x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation precision further to 4-bit as suggested by Taekyun Kim at that time: http://lists.freedesktop.org/archives/pixman/2011-February/001044.html This allows to halve the number of multiplications for bilinear interpolation in C code by using SIMD-alike tricks. But both Taekyun Kim and I were mostly interested in ARM NEON performance, and NEON happens not to suffer from 8-bit interpolation much. Nobody else has tried pushing interpolation precision reduction for faster bilinear interpolation into pixman and it did not happen. But the hope is not totally lost, see the recent discussion: http://lists.freedesktop.org/archives/pixman/2012-May/001930.html Regarding how it affects you. If bilinear interpolation precision gets changed after all, your optimized code in bilinear over__8_ fast path will need to be updated (if we still care about getting identical results everywhere and passing the test suite). You may also want to take part in this activity and evaluate the effects of 8-bit vs. 7-bit vs. 4-bit interpolation for MIPS. + multu $ac0, \wt1, \scratch1 + maddu $ac0, \wt2, \scratch2 + maddu $ac0, \wb1, \alpha + maddu $ac0, \wb2, \red + + ext \scratch1, \tl, 8, 8 + ext \scratch2, \tr, 8, 8 + ext \alpha, \bl, 8, 8 + ext \red, \br, 8, 8 + + multu $ac1, \wt1, \scratch1 + maddu $ac1, \wt2, \scratch2 + maddu $ac1, \wb1, \alpha + maddu $ac1, \wb2, \red + + ext \scratch1, \tl, 16, 8 + ext \scratch2, \tr, 16, 8 + ext \alpha, \bl, 16, 8 + ext \red, \br, 16, 8 + + mflo
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Mon, May 14, 2012 at 9:17 PM, Nemanja Lukic nemanja.lu...@rt-rk.com wrote: Hi Siarhei, I implemented a new version of the (patch below) BILINEAR_INTERPOLATE_SINGLE_PIXEL macro where ANDI/EXT instructions, are substituted with load byte instructions (for better dual-issue instruction balancing) and got these results on my Malta board: Original: [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Opt (ANDI/EXT) [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 Opt2 (load byte instructions) [ 0] image firefox-fishtank 1671.700 1672.006 0.03% 4/4 There is performance improvement, but not impressive as I expected. This is interesting. Have you tried to check where the time is actually spent and what is the performance bottleneck? If the code from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro is taken and put into unrolled loop so that the total number of iterations is equal to CPU clock frequency, then we get: MIPS 74K and ANDI/EXT variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL: real0m40.279s user0m40.260s sys 0m0.006s MIPS 74K and LBU variant of BILINEAR_INTERPOLATE_SINGLE_PIXEL: real0m26.479s user0m26.468s sys 0m0.006s That's ~40 cycles vs. ~26.5 cycles, or approximately ~13.5 cycles saving by replacing 16 ALU instructions with 16 LS instructions. Ideally we would want to have perfect dual issue here and 16 cycles saving, but at least dual issue works. And now code also becomes vulnerable to endianess of the target CPUs. Of course, this can be guarded with some #ifdef's where byte offset in a word is changed according to the endianess of the target CPU (since MIPS CPUs can be both LE and BE). Is this small improvement worth making this code vulnerable to endian issues? If you are already satisfied with this level of performance, then it's probably fine for now. I still need to add improvement for that packing/unpacking of the RGBA pixels after bilinear/before OVER operation, but I don't expect big improvement there (it is just a couple of instructions). It's not just a couple of instructions. By combining the color channels in a register, you are also forcing the processor to finish the calculations for all the needed data. And this is an extra data dependency, which may inhibit instructions reordering. But big improvements are not likely to happen unless there is a clear understanding about what is going on in the CPU pipeline and accounting each spent cycle. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Tue, May 15, 2012 at 5:37 PM, Siarhei Siamashka siarhei.siamas...@gmail.com wrote: I still need to add improvement for that packing/unpacking of the RGBA pixels after bilinear/before OVER operation, but I don't expect big improvement there (it is just a couple of instructions). It's not just a couple of instructions. By combining the color channels in a register, you are also forcing the processor to finish the calculations for all the needed data. And this is an extra data dependency, which may inhibit instructions reordering. Indeed. For instance this is part of the reason why [1] made such a large difference. [1] http://cgit.freedesktop.org/pixman/commit/?id=7d4beedc612a32b73d7673bbf6447de0f3fca298 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote: From: Nemanja Lukic nemanja.lu...@rt-rk.com Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Optimized: cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 This definitely is an improvement. But the firefox-fishtank trace is very dependent on bilinear scaling performance, both x86 SSE2 and ARM NEON demonstrate more than 3x speedup here: http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html I understand that MIPS DSPr2 does not stand a chance competing with 128-bit SIMD competitors, but still some more performance tweaks can be be probably applied. See more comments below. diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h index 8383060..7cf3281 100644 --- a/pixman/pixman-mips-dspr2-asm.h +++ b/pixman/pixman-mips-dspr2-asm.h @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol) \ addu_s.qb \out2_, \d2_, \scratch2 .endm +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br, \ + scratch1, scratch2, \ + alpha, red, green, blue \ + wt1, wt2, wb1, wb2 + andi \scratch1, \tl, 0xff + andi \scratch2, \tr, 0xff + andi \alpha, \bl, 0xff + andi \red, \br, 0xff I suggest to have a look at http://lists.freedesktop.org/archives/pixman/2011-February/001088.html The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro could be replaced with byte load instructions. MIPS74K can't dual issue ALU+ALU instructions, but can dual issue LS+ALU. This look like a potentially huge performance win on MIPS74K hardware, far exceeding the speedup observed on x86. Why is the faster C bilinear code from my old post still not in pixman? As I mentioned there, the discussion is still ongoing about how to improve bilinear scaling performance when SIMD extensions are not available. Reducing interpolation precision from the current 8-bit to 7-bit allows to use signed multiplications and can help a lot x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation precision further to 4-bit as suggested by Taekyun Kim at that time: http://lists.freedesktop.org/archives/pixman/2011-February/001044.html This allows to halve the number of multiplications for bilinear interpolation in C code by using SIMD-alike tricks. But both Taekyun Kim and I were mostly interested in ARM NEON performance, and NEON happens not to suffer from 8-bit interpolation much. Nobody else has tried pushing interpolation precision reduction for faster bilinear interpolation into pixman and it did not happen. But the hope is not totally lost, see the recent discussion: http://lists.freedesktop.org/archives/pixman/2012-May/001930.html Regarding how it affects you. If bilinear interpolation precision gets changed after all, your optimized code in bilinear over__8_ fast path will need to be updated (if we still care about getting identical results everywhere and passing the test suite). You may also want to take part in this activity and evaluate the effects of 8-bit vs. 7-bit vs. 4-bit interpolation for MIPS. + multu $ac0, \wt1, \scratch1 + maddu $ac0, \wt2, \scratch2 + maddu $ac0, \wb1, \alpha + maddu $ac0, \wb2, \red + + ext \scratch1, \tl, 8, 8 + ext \scratch2, \tr, 8, 8 + ext \alpha, \bl, 8, 8 + ext \red, \br, 8, 8 + + multu $ac1, \wt1, \scratch1 + maddu $ac1, \wt2, \scratch2 + maddu $ac1, \wb1, \alpha + maddu $ac1, \wb2, \red + + ext \scratch1, \tl, 16, 8 + ext \scratch2, \tr, 16, 8 + ext \alpha, \bl, 16, 8 + ext \red, \br, 16, 8 + + mflo \blue, $ac0 + + multu $ac2, \wt1, \scratch1 + maddu $ac2, \wt2, \scratch2 + maddu $ac2, \wb1, \alpha + maddu $ac2, \wb2, \red + + ext \scratch1, \tl, 24, 8 + ext \scratch2, \tr, 24, 8 + ext \alpha, \bl, 24, 8 + ext \red, \br, 24, 8 + + mflo \green, $ac1 + + multu $ac3, \wt1, \scratch1 + maddu
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote: From: Nemanja Lukic nemanja.lu...@rt-rk.com Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Optimized: cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 BTW, it is also really easy to tweak lowlevel-blt-bench to also optionally benchmark NEAREST and BILINEAR scaling and get some synthetic MPix/s statistics for these cases. The only thing which needs to be done is to simply apply an almost-identity matrix to the source image. It will prevent pixman from using unscaled fast paths, while having only minimal effect on the result of actual compositing operations (so unscaled MPix/s numbers can be even directly compared with scaled MPix/s for every operation). Anybody willing to submit a patch? -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over_8888_8_8888 fast path.
Hi Siarhei, You are right, 74k core is out-of-order dual-issue core (LS+ALU). Using byte load instructions instead of the ANDI/EXT is nice tweak to try, with potential big performance improvement. I'll benchmark this, and upload a new patch (combined with the better-commented commit for the fix I pushed also, for over_n___ca/over_n__0565_ca routines). Combining RGBA pixels at the end of the BILINEAR_INTERPOLATE_SINGLE_PIXEL macro and later splitting them again in OVER__8_ macro is consequence of my intention of using BILINEAR_INTERPOLATE_SINGLE_PIXEL in more bilinear routines, like bilinear_scanline_/0565_/0565_SRC/ADD where pixels once packed to RGBA, don't need to be unpacked any more. But maybe something like a macro parameter which will tell if pixels should or should not be combined in RGBA, could be added to this macro. I'll look into this. Best Regards, Nemanja Lukic -Original Message- From: Siarhei Siamashka [mailto:siarhei.siamas...@gmail.com] Sent: Friday, May 11, 2012 10:55 AM To: Nemanja Lukic Cc: pixman@lists.freedesktop.org; Nemanja Lukic Subject: Re: [Pixman] [PATCH 2/2] MIPS: DSPr2: Added bilinear over__8_ fast path. On Thu, May 3, 2012 at 1:03 AM, Nemanja Lukic nlu...@mips.com wrote: From: Nemanja Lukic nemanja.lu...@rt-rk.com Performance numbers before/after on MIPS-74kc @ 1GHz Referent (before): cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 2289.180 2290.567 0.05% 5/6 Optimized: cairo-perf-trace: [ # ] backend test min(s) median(s) stddev. count [ # ] image: pixman 0.25.3 [ 0] image firefox-fishtank 1700.925 1708.314 0.22% 5/6 This definitely is an improvement. But the firefox-fishtank trace is very dependent on bilinear scaling performance, both x86 SSE2 and ARM NEON demonstrate more than 3x speedup here: http://ssvb.github.com/2012/05/04/xorg-drivers-and-software-rendering.html I understand that MIPS DSPr2 does not stand a chance competing with 128-bit SIMD competitors, but still some more performance tweaks can be be probably applied. See more comments below. diff --git a/pixman/pixman-mips-dspr2-asm.h b/pixman/pixman-mips-dspr2-asm.h index 8383060..7cf3281 100644 --- a/pixman/pixman-mips-dspr2-asm.h +++ b/pixman/pixman-mips-dspr2-asm.h @@ -566,4 +566,60 @@ LEAF_MIPS32R2(symbol) \ addu_s.qb \out2_, \d2_, \scratch2 .endm +.macro BILINEAR_INTERPOLATE_SINGLE_PIXEL tl, tr, bl, br, \ + scratch1, scratch2, \ + alpha, red, green, blue \ + wt1, wt2, wb1, wb2 + andi \scratch1, \tl, 0xff + andi \scratch2, \tr, 0xff + andi \alpha, \bl, 0xff + andi \red, \br, 0xff I suggest to have a look at http://lists.freedesktop.org/archives/pixman/2011-February/001088.html The ANDI/EXT instructions from BILINEAR_INTERPOLATE_SINGLE_PIXEL macro could be replaced with byte load instructions. MIPS74K can't dual issue ALU+ALU instructions, but can dual issue LS+ALU. This look like a potentially huge performance win on MIPS74K hardware, far exceeding the speedup observed on x86. Why is the faster C bilinear code from my old post still not in pixman? As I mentioned there, the discussion is still ongoing about how to improve bilinear scaling performance when SIMD extensions are not available. Reducing interpolation precision from the current 8-bit to 7-bit allows to use signed multiplications and can help a lot x86 MMX/SSE2/SSSE3 code. It may even make sense reducing interpolation precision further to 4-bit as suggested by Taekyun Kim at that time: http://lists.freedesktop.org/archives/pixman/2011-February/001044.html This allows to halve the number of multiplications for bilinear interpolation in C code by using SIMD-alike tricks. But both Taekyun Kim and I were mostly interested in ARM NEON performance, and NEON happens not to suffer from 8-bit interpolation much. Nobody else has tried pushing interpolation precision reduction for faster bilinear interpolation into pixman and it did not happen. But the hope is not totally lost, see the recent discussion: http://lists.freedesktop.org/archives/pixman/2012-May/001930.html Regarding how it affects you. If bilinear interpolation precision gets changed after all, your optimized code in bilinear over__8_ fast path will need to be updated (if we still care about getting identical results everywhere and passing the test suite). You may also want to take part in this activity and evaluate the effects of 8-bit vs. 7-bit vs. 4-bit interpolation for MIPS. + multu $ac0, \wt1, \scratch1 + maddu