Hi Paul,
--- Paul Leyland <[EMAIL PROTECTED]> a �crit:
> > From: Olivier Langlois
> [mailto:[EMAIL PROTECTED]]
...
>
> Actually, we at Microsoft Research in Cambridge have
> seen similar effects
> when compiling and running FFTW code. Our discovery
> is that the alignment
> of FP data values is critical. Get it wrong, and
> performance can plummet.
> Unless you set the alignment explicitly, it will be
> wrong approximately half
> the time.
>
> Jonathan Hardwick investigated this effect as part
> of his research into
> high-performance computing. He gave an internal
> seminar (which is where I
> learned about it) and wrote it up in detail. The
> full details are at
>
http://www.research.microsoft.com/users/jch/fftw-performance.html
I agree with you that data aligment is an important
issue for performance. Non-aligned data need 2 memory
read to be fetched. I was aware that MSVC++ 6 is
superior than some other compilers on this aspect
because it does an AND on the stack pointer with
0xFFFFFFF8 to be sure that local variables will be
aligned within an 8 bytes boundary.
However, MSVC FPU registers allocation and assignement
algorithm could be greatly improved(that was my point
in my previous e-mail and Assembler listings generated
by MSVC from FFTW are great to see what I mean). MSVC
almost never use all 8 FPU registers.
It seems like your optimizer is good on a statement
basis but as soon as you have a sequence of short FP
instructions, it does a very bad job.
With these unused registers:
- it could start a new FP operation instead of storing
back to the memory the result of an unfinished FP
operation which waste a few CPU cycles.
- it could use them to store temporary variables and
reduce memory access.
With a better FPU registers allocation and assignement
algorithm than the actual one, I wouldn't be surprise
to see FFTW 20-30% faster than now with MSVC.
While we're at it, there is a simple optimization that
MSVC doesn't perform.
When there is an expression like that:
x % n
where n is a numeric constant of the form 2^y
and y <= 32. The compiler could replace this by
x & (n-1) (An optimization using a Mersenne Number :-)
instead of doing a costly division like it
is done right now.
Greetings
Olivier Langlois
http://www3.sympatico.ca/olanglois -
[EMAIL PROTECTED]
Nortel Networks 514-818-1010 x46659
Montreal, Canada
__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers