Re: [Pixman] More MIPS OVER fast paths (over_8888_n_8888, over_8888_n_0565, over_0565_n_0565, over_8888_8_8888, over_8888_8_0565, over_0565_8_0565, over_8888_8888 and over_8888_8888_8888) including OV
Hi Soren, Siarhei Here are results measured for this OVER combiner on couple of OVER fast-paths: Before adding OVER combiner: over__ = L1: 95.65 L2: 70.26 M: 13.95 ( 74.24%) HT: 16.56 VT: 15.96 R: 14.90 RT: 9.05 ( 53Kops/s) over__8_ = L1: 13.62 L2: 11.22 M: 7.57 ( 80.53%) HT: 6.24 VT: 6.19 R: 6.13 RT: 3.93 ( 30Kops/s) over__8_0565 = L1: 7.37 L2: 8.30 M: 6.24 ( 58.08%) HT: 5.46 VT: 5.38 R: 5.26 RT: 3.35 ( 27Kops/s) over_0565_8_ = L1: 10.56 L2: 9.32 M: 7.13 ( 66.42%) HT: 5.83 VT: 5.79 R: 5.74 RT: 3.60 ( 28Kops/s) over_0565_8_0565 = L1: 7.82 L2: 7.20 M: 6.09 ( 48.62%) HT: 5.11 VT: 5.07 R: 4.93 RT: 3.13 ( 26Kops/s) After: over__ = L1: 163.64 L2: 83.68 M: 17.67 ( 94.15%) HT: 17.09 VT: 16.60 R: 15.31 RT: 9.60 ( 55Kops/s) over__8_ = L1: 25.98 L2: 22.50 M: 11.54 (122.95%) HT: 9.94 VT: 9.63 R: 9.20 RT: 5.80 ( 38Kops/s) over__8_0565 = L1: 14.00 L2: 12.45 M: 8.77 ( 81.79%) HT: 6.99 VT: 6.89 R: 6.72 RT: 3.95 ( 30Kops/s) over_0565_8_ = L1: 16.75 L2: 14.82 M: 10.06 ( 93.83%) HT: 7.98 VT: 7.79 R: 7.48 RT: 4.22 ( 31Kops/s) over_0565_8_0565 = L1: 10.76 L2: 9.69 M: 7.86 ( 62.79%) HT: 6.18 VT: 6.11 R: 5.97 RT: 3.48 ( 28Kops/s) Thanks, Nemanja Lukic -Original Message- From: Søren Sandmann [mailto:sandm...@cs.au.dk] Sent: Tuesday, September 25, 2012 6:23 AM To: Lukic, Nemanja Cc: pixman@lists.freedesktop.org Subject: Re: [Pixman] More MIPS OVER fast paths (over__n_, over__n_0565, over_0565_n_0565, over__8_, over__8_0565, over_0565_8_0565, over__ and over___) including OVER combiner. Nemanja Lukic nlu...@mips.com writes: Added optimizations for several OVER fast paths: - over__n_ - over__n_0565 - over_0565_n_0565 - over__8_ - over__8_0565 - over_0565_8_0565 - over__ - over___ Including OVER combiner. Per previous code review: - Previously pushed single big commit is now divided into 4 smaller pieces. Thanks for the patches. I have pushed them to master with a few formatting fixes. However, you should get a freedesktop account so that you can push patches yourself, or at least, if you want me to merge them, provide a public git repository that can be pulled from. - Added OVER combiner. Did you do any measurements of this one? As Siarhei said: As for the performance numbers. I wonder how much faster would these new specialized MIPS fast paths be if we had a DSPr2 optimized OVER combiner? You can check sse2_combine_over_u and neon_combine_over_u functions as examples of existing combiners. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] Questionable numbers from lowlevel-blt-bench
On Mon, Oct 1, 2012 at 1:17 AM, Jonathan Morton jonathan.mor...@movial.com wrote: On Sun, 30 Sep 2012 15:05:18 -0700, Matt Turner matts...@gmail.com wrote: In doing performance work, I've noticed some weird results from lowlevel-blt-bench. Often it has seemed that the RT results determined the Kops/s almost entirely. I came across an instance of this today which was particularly striking: Before: add__ = L1: 47.01 L2: 36.84 M: 18.96 ( 33.14%) HT: 35.94 VT: 33.82 R: 30.64 RT: 19.36 ( 181Kops/s) After: add__ = L1: 230.78 L2: 200.86 M: 90.48 (159.44%) HT: 48.41 VT: 45.46 R: 42.78 RT: 19.22 ( 181Kops/s) L1/L2/M numbers are improved by ~5x. HT, VT, and R numbers are improved by ~1.35x. RT doesn't change... neither does Kops/s. What's going on here, and can we make the composite result more sensible? The figures in brackets are derived directly from one or more of the other figures. In this case, the Kops/s number is derived directly from the RT number, which should explain why they correlate. Ahh. At least I (and I'm pretty sure others too) thought that the Kops number was supposed to be a composite of HT, VT, RT, and R. This explains it then. The percentage figure, meanwhile, represents a percentage of memory bandwidth used by this blitter (under the M test), the peak bandwidth being derived from an earlier measurement. (You're seeing more than 100%, which suggests that the earlier measurement is not optimal.) Indeed. I'm prefetching in the modified function. The RT figure is meant to measure, as directly as possible, the per-call overhead which does not depend on the number of pixels involved. Accordingly, it is not expected to change significantly when doing pixel-related optimisations. Right, makes sense. Thanks! Matt ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman