Siarhei Siamashka <siarhei.siamas...@gmail.com> writes: > By the way, fallbacks may need to be tweaked a bit to be sure that > really the fastest code is selected. For example, after the introduction > of faster fetchers [1], now the performance of 'over_8888_8888' > operation (translucent case) with the nearest scaling of the source > image from 2000x2000 to approximately same size looks like this: > > C fetcher + C combiner: ~76.71 MPix/s > C fast path: ~106.47 MPix/s > C fetcher + SSE2 combiner: ~182.65 MPix/s > SSE2 fast path: ~270.57 MPix/s > > The important part is that "C fetcher + SSE2 combiner" is now faster > than full "C fast path", which performs everything in a single pass, but > without SSE2. So when SSE2 combiners are available, it makes sense to > deactivate nearest scaling C fast paths for OVER operator. Not everything > is so simple though, because this C fast path has special processing for > fully transparent or opaque pixels, and may in some cases be actually > better. But it is clearly slower if the performance of the worst case is > important. > > Something similar may apply to PPC Altivec optimizations. Because there > are only Altivec combiners but not full Altivec fast paths, the existing > C fast paths will be executed for some compositing operations, preventing > the use of Altivec combiners.
Indeed, this is increasingly a problem. Another variation is if there is a SIMD fast path that can deal with affine transformations, and a general fast path specialized for scaling transformations. Which one do you pick for a scaling operation? It isn't obvious. Or, suppose there is an MMX fast path for over_8888_8888, but also a "noop" source iterator for a8r8g8b8. Then the general path would degenerate to just calling sse2_combine_over_u() in a loop with no fetching, which is likely faster than mmx_composite_over_8888_8888(); I don't have a good answer, but here are some random ideas: - Classify compositing routines in terms of performance. Ie., try and statically associate each compositing routine and fetcher with a number, then pick the one that should be fastest. - If the fast path cache is empty for the operation in question, try all possibilities and store the best one in the cache. This will probably be unpleasant and complicated. - Tweak the fallbacks so that things work in practice. This is ugly and fragile. - JIT compiler. This is a good solution, but a lot of work. It isn't a huge problem yet, but it will only get worse. Soren _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman