Happy new year to all,
Current timings on a 512K FFT (exponents up to 10.32 million):
1.4GHz P4: 0.126 sec
1.1GHz T-bird: 0.103 sec
1.0GHz P3: 0.145 sec
As pointed out in an earlier email, using the new SSE2 instructions and
coding for the changed cache line size should allow much better timings
on the P4.
I coded the most common FFT building block, four_complex_fft, using the
SSE2 instructions. This macro takes 8 floating point values (4 complex
numbers) and does 2 "levels" of the FFT. The SSE2 implementation will
operate on two different sets of 8 floating point values and also do
2 levels of the FFT (i.e. it does exactly twice as much work).
Next I timed this macro using consecutive memory locations. The P3 version
of this macro takes 52 clocks. The P4 new implementation takes 48 clocks!
More than twice the throughput.
Now the bad news, when I use the prime95 memory layout where the 8 input
values come from 8 different cache lines and the modified values are
output to the same cache lines (an in-place FFT), the P4 code now takes 112
clocks.
The cause is the new 64-byte cache line. The L1 cache is write-back.
Any changes to the L1 cache are written back to the L2 cache. On the P3
the L1 and L2 cache line size is 32 bytes. A write to the L2 cache takes
7 clocks (I think). Thus, the P3 macro will run in 8*7=56 clocks. The
P4 has a 64 byte L1 cache. The update of the L2 cache is done with two
writes of 32 bytes each taking 7 clocks. Thus, the P4 macro takes 8*14=112
clocks.
After much thought (coming up with a scheme that does not cause thrashing
of the 64 TLBs is not easy), I think I've worked out a new memory layout
for prime95. This is a mostly-in-place FFT where the outputs of the macros
are written to consecutive memory locations. I've now begun the process
of recoding all of the prime95 building blocks, then I'll code up the
new memory layout for the P4. Obviously this will take a good deal of time.
I thought you might be interested in a rough estimate of how fast a 512K FFT
might run in the finished version. The same macro above when operating
out of the L2 cache is a little slower - call it 56 clocks. The macro
processes 16 doubles, or 56/16=3.5 clocks/double. The 512K FFT has 19
forward FFT levels and 19 inverse FFT levels. The first 3 levels are special
and are as expensive as 2 levels later on. Remember, the macro does 2 levels
meaning we will run it 9 times in the forward FFT and 9 times in the inverse
FFT. Total cost is 512K doubles * 3.5 clocks/double * 18 / 1.4G clocks/sec.
This is .024 sec. I've estimated the carry propagation step at .004 sec.
The FFT will also access main memory 3 times, much of this cost can be
hidden with judicious use of the PREFETCH instruction. This estimate also
does not include the cost of accessing the sin/cos data. All in all, I
think a time of 0.040 seconds should be achievable.
Having fun,
George
_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers