> The ideal situation is if the C code is written in such a way that > modern optimizing compilers do the right > thing by default and produce good code for any CPU. This should mean > that the compilers automatically > produce SSE code where they should if it is enabled.
Yes, a good thought. Unfortunately the compilers are not NEARLY there yet w/regard to using SSE instructions in the best way possible. And I'm not sure they're really going to get there... It would be difficult for C/C++, where the natural value is a single int, without very specific design considerations in the source code to take good advantage of the things SSE has to offer, which is primarily to carry multiple data items per register and to parallelize calculations on that data. The challenge is to design an application to take advantage of being able to do multiple calculations at once, and to carry related chunks of data in a big (e.g., __m128) register. Pixel manipulations CAN map well into this sort of thing... My Photoshop plug-ins, for example, now use SSE2 throughout and everything's stored chunky and in floating point (e.g., we put one RGBA pixel in an __m128). The floating point gives advantages in terms of overflow/underflow/loss-of-precision protection, and the parallel processing offsets the disadvantages of the increased memory bandwidth utilization for the long values. But of course the source code is less maintainable and less portable because of the SSE usage - it's less like C and more like embedded assembly (we use the Intel intrinsics such as _mm_mul_ps). We went into this with our eyes open and I'm glad we made the decisions we did. The interesting thing is that while we were refactoring our plug-ins, the use of SSE really didn't pay off in performance until we had the code embracing the concepts throughout. Everything got faster all at once near the end of the project. I can say from that experience that the one thing you absolutely DON'T want is to have is HALF an SSE implementation... Getting things into and out of XMM registers (i.e., during conversions) is inefficient. The application has to "think overall" in parallel and use a format (e.g., floating point) throughout that matches the SSE capabilities to work well. By the way, as an exercise to reinforce the above, I re-coded the LittleCMS floating point trilinear interpolation algorithm using SSE2 intrinsics. It ended up delivering the same performance as the C-coded version. Why not better? Because the table-based design of the Little CMS library doesn't suit parallel calculations so there were only limited things I could do. Let me be clear, I'm not suggesting redesigning Little CMS soup to nuts. Just throwing out a few thoughts and ideas. :) Regarding Marti's comment: > I think optimizations have to be done by arranging C code to help > compiler I agree and I'm finding by doing so and testing the results that there is still some additional performance to be had from rearranging the code to do things like put less broad requirements on the compiler to keep intermediate data for a long sequence of instructions (e.g., to reduce register starvation). By the way, the performance appeared to have dropped a fair bit between release 2.8 and what I downloaded from Git just the other day. I think I've got it all back and a little more at this point. -Noel ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Lcms-user mailing list Lcms-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lcms-user