Simon Burge writes (Hi, Simon!):
>It looks like the DEC C compiler has the same problem - adding "-assume
>accuracy_sensitive" to the compiler command line fixed the problem where
>MacLucasUnix thought it could do 33219281 with a 1M FFT
Yes, MacLucasUnix uses the same fast round strategy, so would suffer the
same fate under aggressive optimization unless care is taken.
>Have you got 2.6x and 2.6a mixed up? From what I understand, David used
2.6x for the double check, and 2.6a is currently on your ftp site.
I hope not - otherwise, what was I working on for most of the past week? :)
Actually, you're right - I created 2.6b and executables on another server
(the one which has my F90 license), and in my haste to transfer the final
files to my webserver and finish writing my posting I neglected to transfer
the 2.6b source to the webserver (the executables were there). It's there
now. The 2.6a source has been moved to the /pub/archived directory, for those
of you who want to see what changes I made to the source to squeeze more
speed out of the code.
>These results are for a copy of Mlucas_2.6a.f90 I compiled with:
>
> f90 -o lm -tune ev6 -O5 lucas_mayer_V2.5b.f90
Many thanks for the timings! Looks like Mlucas is no more than 20% slower
than MacLucasUnix at any runlength, and catches up with it at 1024K. Now
get ready for a 15-20% further speed boost...
The good news is, v2.6b should be about 10% faster (esp. at large FFT
lengths than your timings of 2.6a.
Note that for 2.6b I've precompiled 3 separate executables for the three
main Alpha generations (ev4 = 21064, ev5 = 21164, ev6 = 21264). If you
have time, try compiling the 2.6b source locally with various options
and see if you can beat the ones I used for the ev6 executable (but
please see me note about the timings below!)
>The Mlucas_2.6a.exe from your FTP site gives the same results but is
>slightly slower (around 5%) on the DS20:
That 5% is likely due to the fact that the 2.6a .exe was compiled
for a generic platform, i.e. sans ev4/5/6-specific tunings.
> % time ./Mlucas_2.6a.exe < foo
Note that on platforms which support the real*16 data type, you can't
just do a single timing run and get an accurate timing. That's because
if real*16 (actually, any extended-precision floating data type with
at least 18 decimal digits of precision - I coded it that way to access
the x86 80-bit register type, with compilers that can make use of it)
is available, the program auto-detects it and uses it for high-precision
FFT sincos data initializations. This may take up to the equivalent of
40-80 iteration times. To subtract this initialization component, you
need to do 100 and 200-iteration timing runs, subtract the difference
and divide by 100. (The README file describes this.)
I look forward to the revised (and hopefully improved) ev6 timings!
Brian Beesley writes (Hi, Brian!):
>> Hmm, those upper limits seem a bit low. Does MacLucasUnix tell you when
>> the exponent is too large?
>
>The way I understand MacLucasUNIX _should_ work is that it guesses
>that the FFT size the next power of 2 above exponent/32 might work.
>This is usually hopelessly optimistic. However, it monitors the round-
>off errors for the first "few" (hundreds) of iterations, if the round-
>off errors are too large then it restarts with double the FFT size.
See Simon's note about the accuracy_sensitive option being needed for
this to work properly.
>> size: 128K 256K 320K 384K 448K 512K 640K 768K 896K 1024K
>> pmax: 2.62M 5.20M 6.46M 7.71M 8.96M 10.2M 12.6M 15.1M 17.5M 20M
>>
>> On strictly real*8-type hardware, the upper limits are about 1% less.
>> The only cost of real*16 inits is a greater initialization time.
>
>Does this not make assumptions about the accuracy of the trig
>functions in the library, also does it take into account the effect
>of any "magic numbers" used in the DWT to achieve remaindering modulo
>2^p-1 ?
Sure it makes some assumptions, but not unreasonable ones. I know this
because the SPEC98-submitted version of Mlucas has been tried on a huge
variety of platforms and compiled with many different compilers, and the
"1% rule" has proved a good one. Even with the most aggressive compile
options and specifying less-accurate trig functions that are faster (a
silly thing to do with Mlucas, since one only computes them once, so why
not spend a few seconds more and get better accuracy?), one could still
always get a pmax at least 98% of the ones listed above. The DWT weight
factors (your "magic numbers") obey similar rules, i.e. there's nothing
special about them (at least to the F90 compiler - perhaps the C compiler
uses a different library? That would be dumb, but not shocking) that
should cause them to be much more inaccurate than the trig functions,
assuming the intrinsic function library was well-written (that may not
be a good assumption on all platforms, unfortunately.)
>> Now, if we could squeeze gains out by assembly coding anywhere near those
>> Jason Papadopoulos obtained on the SPARC via the ASM route, that would
>> be impressive, indeed.
>
>I don't know what the F90 compilers do - but gcc can produce ASM from
>C source files - if I had loads of time, I'd start from that & a
>profile listing of where the program spends most of its time, & see
>how much I could improve those bits of the code by massaging the ASM
>output by the compiler.
Pretty much what I had in mind for starters, except I'd rather spend my
time working on basic algorithmic improvements and code functionality
(e.g. adding p-1 factoring.) Any ASM experts out there with some free
time?
Cheers,
Ernst
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers