Re: Mersenne: Re: cpu years/day vs GFlops/sec

2000-01-09 Thread George Woltman

Hi,

At 06:50 PM 1/6/00 EST, [EMAIL PROTECTED] wrote:
Spike Jones wrote:

Processor gurus, please:  using the equivalence that is suggested
by the primenet status page [86.6 P90 CPU yr/day = 1042 GFlops]
I calculate that a floating point operation must be about 3 CPU cycles.

Indeed, I calculate ~0.4 FLOP/cycle, which at first glance seems about a
factor of 2 too slow, even on the humble P90 (which should, assuming no
cache misses, be able to dispatch one FADD per cycle and (I believe - x86
experts, please correct me if I'm wrong) one FMUL every other cycle, for
a peak throughput of 1.5 FLOP/cycle.

The humble P90 can only do one FADD *OR* FMUL per cycle.  Thus, maximum
throughput is 1.0 FLOP/cycle.  Worse yet a floating point load takes one
clock and a store takes two clocks.  With only 8 registers there are a
lot of loads and stores.  The PPro, P-II, P-III, and Celeron have
a better architecture that allows loads and stores to run in parallel
with the FADDs and FMULs.  This makes it easier to approach the 1.0
FLOP/cycle theoretical maximum.

Regards,
George

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: cpu years/day vs GFlops/sec

2000-01-09 Thread Colin Percival

On Thu, 6 Jan 2000 18:50:41 EST, [EMAIL PROTECTED] wrote:
Indeed, I calculate ~0.4 FLOP/cycle, which at first glance seems about a
factor of 2 too slow, even on the humble P90 (which should, assuming no
cache misses, be able to dispatch one FADD per cycle and (I believe - x86
experts, please correct me if I'm wrong) one FMUL every other cycle, for
a peak throughput of 1.5 FLOP/cycle.

  Almost.  Although the FADD unit can start an instruction every cycle and
the FMUL unit every second cycle, they share some overhead (eg, decode and
control logic), such that only one of the two can be started in any given
cycle.  This is true of both P5 and P6 architectures.

Colin Percival

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers



Mersenne: Re: cpu years/day vs GFlops/sec

2000-01-06 Thread EWMAYER

Spike Jones wrote (Hi, Spike! Live long and prosper...)

Processor gurus, please:  using the equivalence that is suggested
by the primenet status page [86.6 P90 CPU yr/day = 1042 GFlops]
I calculate that a floating point operation must be about 3 CPU cycles.

Indeed, I calculate ~0.4 FLOP/cycle, which at first glance seems about a
factor of 2 too slow, even on the humble P90 (which should, assuming no
cache misses, be able to dispatch one FADD per cycle and (I believe - x86
experts, please correct me if I'm wrong) one FMUL every other cycle, for
a peak throughput of 1.5 FLOP/cycle. As a second ballpark-style check,
I did a similar calculation for Mlucas 2.7z running on a 500MHz Alpha
21264, which can in theory dispatch 2 FLOP (1 FADD and 1 FMUL) per cycle.
A 256K (real-element) DWT-based FFT (gory details of the calculation are
appended below) gets around 0.66 FLOP/cycle on the 21264. Thus, Prime95
seems to average around 27% of the peak theoretical throughput on the
Pentium (probably somewhat better on the PII and PIII) and Mlucas gets
around 33% of the peak throughput of the 21264.

Much of the difference (theoretical vs. actual throughput) is due to
the real-world overhead of servicing cache misses. (Note that on the
21264, a 256K array of doubles plus some smaller arrays for DWT weights
and FFT sincos data nearly fit in the 4 MB L2 cache, thus explaining
much of the performance gain vs. Prime95, but one still has L1 cache
misses to service.)

Another source of the discrepancy (at least for Mlucas) is the code's
mix of instructions, which tend to have around a 60/40 mix of FADD/FMUL,
thus ensuring that on the Alpha (even in the absence of cache misses)
the floating multiplier is idle at least a third of the time.

Lastly, all of these estimates neglect loads and stores in the FLOP count,
i.e. underestimate the true FLOP count by as much as a factor of 1.5 to 2.
That means that the CPU is really much busier than one expects simply
based on a count of the arithmetic operations.

Cheers,
-Ernst

(Notes: real vector length 256K = 2^18 means complex length 2^17, for
which the Mlucas FFT algorithm uses complex radices 8,8,8,16,16.
A Radix-8  pass, with twiddles, needs 2.75 fmul, 5.0 fadd per real input.
A Radix-16 pass, with twiddles, needs 5.00 fmul, 6.5 fadd per real input.
Assuming complex twiddle multiplies are needed for each pass (in reality
they are not needed for the first pass of the FFT) slightly overestimates
the FLOP count, so neglect the extra ops needed for the real==complex
wrapper/square steps needed between the forward and inverse FFT, which
cost about the same per input as a pair of twiddle multiplies. Thus,
the above combination of radices needs 3*2.75+2*5.00 = 18.25 fmul and
3*5.0+2*6.5 = 28.0 fadd per real input per FFT. We do two FFTs per
iteration, hence need 36.5 fmul and 56 fadd per input element, or about
83 FLOP per real input. Add another 15-16 FLOP per real element during
the carry propagation phase, and we get around 100 FLOP per real vector
element per iteration. Multiply by 2^18 real elements to get the total
FLOP count per iteration, divide by .079 seconds per iteration on a 500MHz
21264 and divide by 5 cycles per second to get ~0.66 FLOP/cycle.)

_
Unsubscribe  list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ  -- http://www.tasam.com/~lrwiman/FAQ-mers