On Mon, 6 Aug 2012 07:21:54 +0200 Nemanja Lukic <nlu...@mips.com> wrote:
>Performance numbers before/after on MIPS-74kc @ 1GHz: > >lowlevel-blt-bench results Hi, thanks for the patch. Just as with the previous patch, the summary line is way too long. It is also an indirect indication that the patch is better to be split into a few independent parts (for better bisecting if nothing else). >Referent (before): > over_8888_n_8888 = L1: 9.92 L2: 11.27 M: 8.50 ( 45.23%) > over_8888_n_0565 = L1: 8.95 L2: 8.33 M: 6.95 ( 27.74%) > over_0565_n_0565 = L1: 7.56 L2: 7.24 M: 6.16 ( 16.38%) > over_8888_8_8888 = L1: 12.54 L2: 10.86 M: 8.18 ( 54.36%) > over_8888_8_0565 = L1: 8.86 L2: 8.11 M: 6.72 ( 35.71%) > over_0565_8_0565 = L1: 7.43 L2: 7.05 M: 5.98 ( 23.85%) > >Optimized: > over_8888_n_8888 = L1: 28.02 L2: 24.92 M: 14.72 ( 78.15%) > over_8888_n_0565 = L1: 18.76 L2: 17.55 M: 13.11 ( 52.19%) > over_0565_n_0565 = L1: 15.47 L2: 14.52 M: 12.30 ( 32.65%) > over_8888_8_8888 = L1: 26.92 L2: 23.93 M: 13.65 ( 90.58%) > over_8888_8_0565 = L1: 18.14 L2: 16.79 M: 12.10 ( 64.25%) > over_0565_8_0565 = L1: 15.47 L2: 14.61 M: 11.78 ( 46.92%) As for the performance numbers. I wonder how much faster would these new specialized MIPS fast paths be if we had a DSPr2 optimized OVER combiner? You can check "sse2_combine_over_u" and "neon_combine_over_u" functions as examples of existing combiners. Adding many fast path functions does not scale very well. It increases code size, but only covers a small fraction of possible compositing operations. Adding just a single combiner function increases the performance for nearly all the uses of OVER operator. Albeit to a smaller extent and unless C fast paths are taken instead of the general path. Moreover, I still think that it makes a lot of sense to first attempt to implement the code which provides the best performance for OVER operator and only then replicate it to multiple fast path functions. This means, implementing OVER combiner and optimizing the hell out of it would be a really good start. The same applies to bilinear scaling, and focusing on best performing bilinear src_8888_8888 should provide a good starting point for a decent generic bilinear fetcher. But it's a bit more complicated (separable processing may need some high level changes). In any case, I'm going to try implementing fast bilinear scaling for ARM11 (Raspberry Pi), and parts of it may be also useful for MIPS. Moving forward, if you split your patch and it passes tests (BTW, is your system little endian?), then it should be fine for now. But likely not very future proof due to the reasons explained above. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman