Bill Rea <[EMAIL PROTECTED]> wrote (re. Mlucas and
MacLucasUNIX timings on a 400MHz SPARC E450 with 4MB L2 cache):
>>At 256K FFT I was seeing 0.11 secs/iter with
>>Mlucas against 0.15 secs/iter for MLU, at 512K FFT the figures were
>>0.25 secs/iter and 0.29 secs/iter respectively.
...and later added:
>Observing the running process with top, it looks to me as if
>the 256K Mlucas fits completely in the L2 cache, whereas MLU
>takes about 12Mb. At 512K FFT around 60% of the running process
>will fit in the L2 cache. It's a cache size issue here.
A good rule of thumb is that Mlucas needs about (FFT length) x 8 bytes
plus another 10% or so of memory for the data, plus some more for the
program itself. Thus a 256K FFT should need a bit more that 2MB for
data, and perhaps a few 100KB more for code, i.e. the whole thing
should fit into a 4MB L2 cache with room to spare. At 512K FFT length,
data+code need probably 5-6MB total.
>On my Ultra-5 with a small 256Kb L2 cache I get 0.58 secs/iter
>for MLU against 0.78 secs/iter for Mlucas at 512K FFT.
These timings suggest that it's more than a cache size issue - after
all, Mlucas has a smaller memory footprint irrespective of the CPU,
and one would expect some benefit from this even in cases where
both codes significantly exceed the L2 cache size. I wonder if
the fact that the code itself (due to all the routines needed for
non-power-of-2 FFT lengths, which MLU doesn't support) might be
causing a slowdown here, by competing for space in the L2 cache
with FFT data? I've little experience with this aspect of performance,
but perhaps conditional compilation, with each binary incorporating
only the routines it needs for that length, could reduce the code
footprint and help performance - would any of the computer science
experts care to comment?
Or perhaps the compiler and OS already do a good job at keeping only
needed program segments in cache, in which case the problem lies yet
elsewhere.
One other possibility is that the SPARC f90 compiler, being *much*
younger than the (apparently very good) C compiler, has better
support for the v8 and v9 instruction sets, i.e. they didn't put
as much work into including optimizations for older CPUs - I don't
know.
We also could use help from any SPARC employees familiar with the
compiler technology to tell us why the f90 -xprefetch option is
so unpredictable - it speeds some runlengths by up to 10%, but more
often causes a 10-20% slowdown (or no change).
Bill also wrote:
>>>Mlucas runs significantly faster if you can compile and run it
>>>on a 64-bit Solaris 7 system.
I replied:
>>This is the first I've heard of this - roughly how much of a speedup
>>do you see?
Bill again:
>I'm a bit red-faced on this one. I just tried it again and it doesn't.
>This is still a mystery to me. It would seems to me that for
>this type of code that having full access to the 64-bit instruction
>set of the UltraSPARC CPUS and running it on a 64-bit operating system
>would give you the best performance. But that doesn't seem to be the
>case.
Well, I didn't really expect it to make much difference, which is why
I expressed surprise when you said that it did. Was the slowdown you
mentioned for MLU in 64-bit mode also spurious?
Floating-point dominated code like Mlucas shouldn't really benefit
much from the 64-bit mode, except perhaps compared to that generated
by the crappy f90 v1 compiler, which wouldn't do 64-bit load/stores
even if one specified such in the compile flags.
In any event, Mlucas appears to perform very well on the newer SPARC
CPUs and decently well on the old ones (but there one may want to use
MLU at power-of-2 runlengths if it proves faster), so perhaps we
shouldn't get too greedy...I'm joking, of course - speed, give us
more speed!
-Ernst
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers