On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <bavi...@riscosopen.org> wrote: > On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <oded.gab...@gmail.com> > wrote: >>> >>> [exposing general_composite_rect] >>> I can't say that any cleaner solution has occurred to me since then. >> >> >> I think the more immediate solution, as Soren have suggested on IRC, >> is for me to implement the equivalent fast-path in VMX. >> I see that it is already implemented in mmx, sse2, mips-dspr2 and >> arm-neon. From looking at the C code, I'm guessing that it is fairly >> simple to implement. > > > Yes, it's definitely one of the simpler fast paths, with only two > channels to worry about (source and destination) and with one of them > being a constant. I wrote an arm-simd version as well, to add to your > list - it's just that it's still waiting to be committed :) > > I probably ought to get round to exposing general_composite_rect sooner > rather than later anyway - it's one of the few things from my mammoth > patch series last year that Søren commented on and which I haven't got > round to revising yet. > >>> I just had a quick look at the VMX source file, and it has hardly any >>> iters defined. My guess would be that what's being used is >>> >>> noop_init_solid_narrow() from pixman-noop.c >>> _pixman_iter_get_scanline_noop() from pixman-utils.c >>> combine_src_u() from pixman-combine32.c >>> >> I run perf on lowlevel-blt-bench over_n_8888 and what I got is: >> >> - 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.] >> vmx_combine_over_u_no_mask >> - vmx_combine_over_u_no_mask > > > Sorry, my mistake - for some reason I must have thought we were dealing > with src_n_8888 rather than over_n_8888. If you can beat the C version > using a solid fetcher (which fills a temporary buffer the size of the row > with a constant pixel value) and an optimised OVER combiner, then you > should be able to do better still if you cut out the temporary buffer and > keep the solid colour in registers. >
I implemented over_n_8888 for vmx (adapted from sse2) and run the lowlevel benchmark. I got degradation in almost all the benches (on POWER8, ppc64le): reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) L1 572.29 676.6 +18.23% L2 1038.08 672.68 -35.20% M 1104.1 682.63 -38.17% HT 447.45 269.15 -39.85% VT 520.82 357.1 -31.44% R 407.92 259.46 -36.39% RT 148.9 100.25 -32.67% Kops/s 1100 910 -17.27% so I'm not inclined on adding this slow-path :) Oded >>> Presumably for patch 3 of this series (over_n_0565) you wouldn't see >>> the same effect, as that can't be achieved using mempcy(). >> >> >> Where is that patch ? I didn't see it in the mailing list. > > > My bad again - in my mind, the patches for over_n_8888 and over_n_0565 in > C and ARMv6 assembly were a group of four and I overlooked the fact that > when Pekka split them in order to make the benchmarks more robust, he > only reposted the over_n_8888 ones. My original over_n_0565 patches are > here: > > http://patchwork.freedesktop.org/patch/49902/ > http://patchwork.freedesktop.org/patch/49903/ > > Ben _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman