Re: [Pixman] [PATCH 1/4] bits: Implement PAD support in the simple fetcher
On 2013-01-30, at 9:14 PM, Søren Sandmann wrote: Obviously this means that in this case, we're not getting the benefits of any platform-specific fast paths. Perhaps what we need is a pad equivalent of fast_composite_tiled_repeat()? This might be a good idea, though it's still worth finding out what is causing an untransformed image with PAD repeat to be drawn without a mask onto a destination bigger than source image. Either the resulting stripes are desirable for some reason, or Firefox should be more careful about clipping to the source image. I've I had to guess, I'd guess that this is unintentional. There's been a bunch of churn over the semantics drawImage with source rect larger than the image. At one time, the correct behaviour was to use EXTEND_PAD however I believe the currently spec'd behaviour is that the we should be clamping the source rect to the size of the image. I'm not sure we've switched to this yet, but I'll take a look tomorrow. -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] Add a version of bilinear_interpolation for precision =4
On 2013-01-28, at 4:28 PM, Siarhei Siamashka wrote: On Thu, 24 Jan 2013 15:35:24 -0500 Jeff Muizelaar jmuizel...@mozilla.com wrote: Having 4 or fewer bits means we can do two components at a time in a single 32 bit register. Here are the results for firefox-fishtank on a Pandaboard with 4.6.3 and PIXMAN_DISABLE=arm-neon Before: [ # ] backend test min(s) median(s) stddev. count [ 0]image t-firefox-fishtank7.8417.910 0.70%6/6 After: [ # ] backend test min(s) median(s) stddev. count [ 0]image t-firefox-fishtank6.9516.995 1.11%6/6 That's pretty cool, thanks. I just wonder if you might be also possibly interested in a 4-bit bilinear scaling for SSE2 backend? It could provide quite a significant performance improvement. Yes, definitely interested. We would ship 4-bit precision on desktop if it was noticeably faster. -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] 0.29.2
On 2013-01-27, at 2:43 PM, Siarhei Siamashka wrote: I was a bit worried about the projective transforms. The accuracy is improved, but the new code uses a rather naive and slow implementation for long division. So the performance is going to be worse. But as almost nobody seems to be using projective transforms (not cairo at least) and the current implementation itself can be hardly considered optimized at all, this should not be a big problem. FWIW, we currently use pixman's projective transforms in Firefox to implement CSS 3D transforms on platforms without hardware acceleration. Making it slower isn't great. -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH] Add a version of bilinear_interpolation for precision =4
Having 4 or fewer bits means we can do two components at a time in a single 32 bit register. Here are the results for firefox-fishtank on a Pandaboard with 4.6.3 and PIXMAN_DISABLE=arm-neon Before: [ # ] backend test min(s) median(s) stddev. count [ 0]image t-firefox-fishtank7.8417.910 0.70%6/6 After: [ # ] backend test min(s) median(s) stddev. count [ 0]image t-firefox-fishtank6.9516.995 1.11%6/6 --- pixman/pixman-inlines.h | 37 + 1 file changed, 37 insertions(+) diff --git a/pixman/pixman-inlines.h b/pixman/pixman-inlines.h index ab4def0..a03d967 100644 --- a/pixman/pixman-inlines.h +++ b/pixman/pixman-inlines.h @@ -88,6 +88,42 @@ pixman_fixed_to_bilinear_weight (pixman_fixed_t x) ((1 BILINEAR_INTERPOLATION_BITS) - 1); } +#if BILINEAR_INTERPOLATION_BITS = 4 +/* Inspired by Filter_32_opaque from Skia */ +static force_inline uint32_t +bilinear_interpolation(uint32_t tl, uint32_t tr, + uint32_t bl, uint32_t br, + int distx, int disty) +{ +int distxy, distxiy, distixy, distixiy; +uint32_t lo, hi; + +distx = (4 - BILINEAR_INTERPOLATION_BITS); +disty = (4 - BILINEAR_INTERPOLATION_BITS); + +distxy = distx * disty; +distxiy = (distx 4) - distxy; /* distx * (16 - disty) */ +distixy = (disty 4) - distxy; /* disty * (16 - distx) */ +distixiy = + 16 * 16 - (disty 4) - + (distx 4) + distxy; /* (16 - distx) * (16 - disty) */ + +lo = (tl 0xff00ff) * distixiy; +hi = ((tl 8) 0xff00ff) * distixiy; + +lo += (tr 0xff00ff) * distxiy; +hi += ((tr 8) 0xff00ff) * distxiy; + +lo += (bl 0xff00ff) * distixy; +hi += ((bl 8) 0xff00ff) * distixy; + +lo += (br 0xff00ff) * distxy; +hi += ((br 8) 0xff00ff) * distxy; + +return ((lo 8) 0xff00ff) | (hi ~0xff00ff); +} + +#else #if SIZEOF_LONG 4 static force_inline uint32_t @@ -184,6 +220,7 @@ bilinear_interpolation (uint32_t tl, uint32_t tr, } #endif +#endif // BILINEAR_INTERPOLATION_BITS = 4 /* * For each scanline fetched from source image with PAD repeat: -- 1.8.0.19.gf369db1 ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH] Add scaled nearest repeat fast paths
Siarhei wrote this patch and we've been using it in the Mozilla tree since May. Before this patch it was often faster to scale and repeat in two passes because each pass used a fast path vs. the slow path that the single pass approach takes. This makes it so that the single pass approach has competitive performance. patch Description: Binary data ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
[Pixman] [PATCH] Add fast paths for bilinear scaling
This patch adds fast paths for bilinear scaling of (SRC, r5g6b5, r5g6b5), (OVER, a8r8g8b8, r5g6b5), and (OVER, a8r8g8b8, a8r8g8b8). These make a noticeable improvement in the performance of Firefox on Android. -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] Add fast paths for bilinear scaling
And here's the patch. patch Description: Binary data ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On 2012-05-09, at 12:57 PM, Søren Sandmann wrote: Matt Turner matts...@gmail.com writes: I started porting my src__0565 MMX function to SSE2, and in the process started thinking about using SSE3+. The useful instructions added post SSE2 that I see are SSE3: lddqu - for unaligned loads across cache lines I don't really understand that instruction. Isn't it identical to movdqu? Or is the idea that lddqu is faster than movdqu for cache line splits, but slower for plain old, non-cache split unaligned loads? The instructions movdqu, movups, movupd and lddqu are all able to read unaligned vectors. lddqu is faster than the alternatives on P4E and PM processors, but requires the SSE3 instruction set. The unaligned read instructions are relatively slow on older processors, but faster on Nehalem, Sandy Bridge and on future AMD and Intel processors. From http://www.agner.org/optimize/optimizing_assembly.pdf -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman