On Monday 23 August 2010 15:39:00 Jonathan Morton wrote: > > > > I suspect using floats would be *much* better than the existing > > > > fixeds on modern x86_64 systems. But fixed will remain important on > > > > smaller, lighter systems for some time to come. > > > > > > I believe so too, and I have some actual numbers to back it up. > > > > You forgot to attach the numbers. :) > > Well, here is a very brief but representative example: > > (float) src_8888_8_0565 = L1: 111.99 L2: 113.84 M:105.89 ( 20.72%) HT: > 64.94 VT: 78.99 R: 55.08 RT: 23.74 ( 329Kops/s) > (fixed) src_8888_8_0565 = L1: 62.29 L2: 63.66 M: 63.02 ( 12.64%) HT: > 59.05 VT: 57.64 R: 52.52 RT: 32.20 ( 446Kops/s) > > Most of the numbers are Mpix/s, the %-age numbers in the middle are of > estimated available memory bandwidth. The floating-point path has a > large (50%+) advantage in throughput, while the fixed-point path seems > to have less setup overhead which shows up on tiny (8x8) operations.
What kind of hardware did you test by the way? And how did you calculate memory bandwidth percentage (it may be a bit tricky because this operation is kind of asymmetric and reads 5 bytes per pixel, while only writing 2)? But in any case, looks like you are setting the bar way too low and comparing very bad performance with even worse one here :) I don't see any way for this operation (btw, why did you select this one?) to be faster with a floating point implementation on ARM Cortex-A8 for example. With ARM NEON, a vectorized fixed point implementation looks like this: http://lists.freedesktop.org/archives/pixman/2010-August/000414.html The NEON implementation spends ~4 cycles per pixel with the pixel data in L1 cache even for this simple non-pipelined code. The performance typically can improved by something like 30% with better instructions scheduling and pipelining, but it does not make much sense because memory bandwidth is limiting performance anyway and it can't go up unless working with the data in L1 or L2 cache. I hope that ARM Cortex-A9 based systems will have a lot faster memory so that NEON can really shine. Also if you have a look at these NEON patches, it becomes clear that it is not difficult to implement practically any nontransformed compositing operation by just connecting some simple chunks of assembly code together (over_8888_8_0565 is fully reusing the code from over_n_8_0565, and src_8888_8_0565 is just the same as over_8888_8_0565 with a block of instruction removed from the middle). A lot of nontransformed ARM NEON fast paths are quite easy to implement either manually, or generate automatically (again, either produce assembly source code, or do dynamic code generation at runtime). Similar can be also tried for x86, targeting Intel Atom for example, because it has a simple predictable pipeline and also needs performance the most. It does not need manual prefetch, but likes aligned memory accesses for both reading and writing data, as implemented in the recent Intel SSE3 patch which is being under review at the moment. The whole point is that it should be possible to have a really fast code for such simple fast paths, and taking target specific features and properties into account additionally helps. When the performance is far from memory bandwidth limits, it is likely that there is still a lot of room for improvement Regarding fixed point vs. floating point in general. As an example, we can have a look at multimedia codecs. Floating point calculation are preferred for audio codecs nowadays, but video codecs are almost all integer only. The difference is that video typically works with 8-bit samples, but audio works with 16-bit samples at least. Fixed point is usually faster for low precision. Floating point is usually faster for high precision. Based on the instruction cycle timings for armv6 processors and newer, anything that requires 16-bit (or 8-bit) integer multiplications is generally faster with fixed point. But 32-bit integer multiplications are better to be replaced with single precision floating point calculations if possible (and if VFP/NEON unit is available). This is the crossover point. But surely not everything is so simple, floating point operations provide better throughput, but have bigger latency. Also floating point operations are slow to be used for comparison and branching. Integer additions are really fast. On the other hand, fixed point multiplications require extra shift instructions. There is no clear winner for all the possible cases. Anyway, I expect floating point to perform reasonably well for the matrix stuff and coordinates in pixman (if the target CPU has a hardware floating point unit). But IMHO it is too early to drop the use of fixed point implementation for pixel processing. > And that's not exactly the most complex operation on the table. In > fixed-point, it's a multiply by the unified mask followed by a 3-channel > format conversion. Much more trivial than that and you get memcpy(). > > This is all achieved by using lookup tables to accelerate the > fixed-to-float conversions (tables are pre-generated up to 16bpc), > leaving only the store operations to be run through a real > float-to-fixed converter. Table lookups are slow because they may generate a lot of L1 cache misses (especially with lookups using 16-bit values as indexes). But it depends on the pixel data. Solid filled images are going to be faster than the ones filled with random data. Also table lookups make SIMD optimizations quite challenging. -- Best regards, Siarhei Siamashka
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
