> The ideal situation is if the C code is written in such a way that 
> modern optimizing compilers do the right
> thing by default and produce good code for any CPU.  This should mean 
> that the compilers automatically
> produce SSE code where they should if it is enabled.

Yes, a good thought.

Unfortunately the compilers are not NEARLY there yet w/regard to using SSE 
instructions in the best way possible.  And I'm not sure they're really 
going to get there...  It would be difficult for C/C++, where the natural 
value is a single int, without very specific design considerations in the 
source code to take good advantage of the things SSE has to offer, which 
is primarily to carry multiple data items per register and to parallelize 
calculations on that data.

The challenge is to design an application to take advantage of being able 
to do multiple calculations at once, and to carry related chunks of data 
in a big (e.g., __m128) register.

Pixel manipulations CAN map well into this sort of thing...  My Photoshop 
plug-ins, for example, now use SSE2 throughout and everything's stored 
chunky and in floating point (e.g., we put one RGBA pixel in an __m128). 
The floating point gives advantages in terms of 
overflow/underflow/loss-of-precision protection, and the parallel 
processing offsets the disadvantages of the increased memory bandwidth 
utilization for the long values.

But of course the source code is less maintainable and less portable 
because of the SSE usage - it's less like C and more like embedded 
assembly (we use the Intel intrinsics such as _mm_mul_ps).  We went into 
this with our eyes open and I'm glad we made the decisions we did.

The interesting thing is that while we were refactoring our plug-ins, the 
use of SSE really didn't pay off in performance until we had the code 
embracing the concepts throughout.  Everything got faster all at once near 
the end of the project.

I can say from that experience that the one thing you absolutely DON'T 
want is to have is HALF an SSE implementation...  Getting things into and 
out of XMM registers (i.e., during conversions) is inefficient.  The 
application has to "think overall" in parallel and use a format (e.g., 
floating point) throughout that matches the SSE capabilities to work well.

By the way, as an exercise to reinforce the above, I re-coded the 
LittleCMS floating point trilinear interpolation algorithm using SSE2 
intrinsics.  It ended up delivering the same performance as the C-coded 
version.  Why not better?  Because the table-based design of the Little 
CMS library doesn't suit parallel calculations so there were only limited 
things I could do.

Let me be clear, I'm not suggesting redesigning Little CMS soup to nuts. 
Just throwing out a few thoughts and ideas.  :)

Regarding Marti's comment:

> I think optimizations have to be done by arranging C code to help 
> compiler

I agree and I'm finding by doing so and testing the results that there is 
still some additional performance to be had from rearranging the code to 
do things like put less broad requirements on the compiler to keep 
intermediate data for a long sequence of instructions (e.g., to reduce 
register starvation).

By the way, the performance appeared to have dropped a fair bit between 
release 2.8 and what I downloaded from Git just the other day.  I think 
I've got it all back and a little more at this point.

-Noel



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Lcms-user mailing list
Lcms-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lcms-user

Reply via email to