> What kind of hardware did you test by the way? And how did you calculate > memory > bandwidth percentage (it may be a bit tricky because this operation is kind of > asymmetric and reads 5 bytes per pixel, while only writing 2)?
Those particular numbers are from a 3GHz C2D. Unfortunately x86 is still the platform with the best compiler support, so the numbers I can presently get for ARM and PPC are not as favourable (bearing in mind that the GCC I have for PPC32 is rather ancient - I will try to upgrade it). Even on x86 the compiler is clearly overwhelmed, judging by some of the output code, and the fact that performance paradoxically goes down when given a solid source and/or mask (seriously...). So much for "inline is as fast as a macro", and "the compiler makes faster code than a human can". Memory bandwidth is estimated by running memcpy() and counting total bytes loaded *and* stored per second. It's just a guide to estimate efficiency, and does sometimes exceed 100% with SIMD paths. That's fine. > But in any case, looks like you are setting the bar way too low and comparing > very bad performance with even worse one here :) And the comparison of "bad" with "worse" is valid, in that it is very easy to accidentally trigger the "worse" cases with application code - anything that is not yet implemented using a CPU-specific SIMD path, and certainly anything that does not even have a pixman-fast-paths.c entry. I fully expect that SIMD paths written using full knowledge of the CPU's capabilities will always outperform generic code - my aim is merely to improve the generic code to reduce the need for CPU-specific code. > I don't see any way for this operation (btw, why did you select this one?) to > be faster with a floating point implementation on ARM Cortex-A8 for example. Comparing scalar fixed-point with the VFP unit on A8 will definitely favour fixed-point, because the VFP instructions are non-pipelined for no apparent reason (an oversight reportedly fixed in A9 and probably A5 too). Comparing NEON fixed-point with NEON floating-point will *usually* favour fixed-point for simpler operations, because it can simply do more of them per cycle and the setup overhead for floating-point is high. But if you use NEON instructions in a scalar manner, as early versions of LLVM can, I think the balance is tipped in favour of floating-point just as much as for full desktop CPUs. As for why I selected that operation specifically, it was a relatively simple one which appeared in my benchmark run and illustrated my point. It happens to be simple enough that the compiler didn't choke too much on all the inlined dead code. Switching viewpoint from complete fastpaths to combiners, the more complex combiners (especially the PDF series that probably get used by Cairo - any numbers on that?) are definitely faster in floating-point than in fixed-point, even when the penalty of additional memory bandwidth is taken into account. Part of this is due to a rewrite which improves algorithmic efficiency in a few cases, and uses single precision instead of double internally for some of the nastier PDF cases. That result holds up to a limited extent even on Cortex-A8 using the VFP, which is pretty much the worst case available for this conversion. - Jonathan Morton _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
