Hi Paul,

--- Paul Leyland <[EMAIL PROTECTED]> a �crit:
> > From: Olivier Langlois
> [mailto:[EMAIL PROTECTED]]
...
> 
> Actually, we at Microsoft Research in Cambridge have
> seen similar effects
> when compiling and running FFTW code.  Our discovery
> is that the alignment
> of FP data values is critical.  Get it wrong, and
> performance can plummet.
> Unless you set the alignment explicitly, it will be
> wrong approximately half
> the time.
> 
> Jonathan Hardwick investigated this effect as part
> of his research into
> high-performance computing.  He gave an internal
> seminar (which is where I
> learned about it) and wrote it up in detail.  The
> full details are at
>
http://www.research.microsoft.com/users/jch/fftw-performance.html

I agree with you that data aligment is an important
issue for performance. Non-aligned data need 2 memory
read to be fetched. I was aware that MSVC++ 6 is
superior than some other compilers on this aspect
because it does an AND on the stack pointer with
0xFFFFFFF8 to be sure that local variables will be
aligned within an 8 bytes boundary.

However, MSVC FPU registers allocation and assignement
algorithm could be greatly improved(that was my point
in my previous e-mail and Assembler listings generated
by MSVC from FFTW are great to see what I mean). MSVC
almost never use all 8 FPU registers.
It seems like your optimizer is good on a statement
basis but as soon as you have a sequence of short FP
instructions, it does a very bad job.

With these unused registers:

- it could start a new FP operation instead of storing
back to the memory the result of an unfinished FP
operation which waste a few CPU cycles.

- it could use them to store temporary variables and
reduce memory access.

With a better FPU registers allocation and assignement
algorithm than the actual one, I wouldn't be surprise
to see FFTW 20-30% faster than now with MSVC. 

While we're at it, there is a simple optimization that
MSVC doesn't perform.

When there is an expression like that:

x % n

where n is a numeric constant of the form 2^y
and y <= 32. The compiler could replace this by

x & (n-1) (An optimization using a Mersenne Number :-)
instead of doing a costly division like it
is done right now.

Greetings
Olivier Langlois
http://www3.sympatico.ca/olanglois -
[EMAIL PROTECTED]
Nortel Networks 514-818-1010 x46659
Montreal, Canada



__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to