> I took 1 graduate level course in Chip design in college ( this makes me an 
> expert right? ;-).
> Back then the K6 had 3 "floating point pipes?" where the pentium only had 1 
> ( this was several beers ago so bare with me).
> So on each cycle the K6 could potentially do 3 times the work.
> Problem was none of the code of the time was written to take advantage of 
> the additional (pipes)?  (Please correct me if I am using the wrong 
> terminology.)

The K6 had a hopelesly outclassed FPU--it was an evolved form of what they
used when they had a license to make 486s--and performed little better.  You
are thinking of the K7 core--mostly similar to the K8.  You are correct
in that software designed for a P3 will not take full advantage of the K7
core.  Things get a lot more complex when the P4 is thrown into the picture.

> The below statement makes me wonder if Intel really has a better platform 
> for Floating point operations than AMD or if the code has still not been 
> totally optimized to utilize functionality in AMD Opterons?

Pay close attention to which Intel chips were mentioned--the Itanium.  It's
the $4k/pop server chip.  For many tasks, it's very unimpressive, but for
single threaded FP, it's a monster.  dollar for dollar, you can beat it with
about anything, but some people sometimes need one program to go really fast.
Mostly, these are tasks that aren't 'embarrasingly parallel', such as LL
testing.

> I had always heard that AMD was better for FP performance ( I  prefer AMD so 
> I am biased ).
> AMD seems to continually lag behind Intel on memory bandwidth and bus 
> speeds.

To some extent, this has been true since the K7 core came out and give 
Intels P3 a wake up call.  But, the P4 works very well in certain tasks--if
you optimize it extensively and it fits the structures available on the chip.
This Itanium is similar, but is more general purpose and much more expensive.

> Was wondering if anyone was willing to open floor to a discussion on which 
> tests the industry measures Floating point performance and which of these 
> tests would most closely relate to running Prime95.  Or perhaps in this case 
> the memory benchmarks are equally important.

Take a look at the STREAM benchmark for a good benchmark in this area.

> Of course this could open a whole can of wurms as to whether these tests are 
> really "optimized" for the processor / platform?

SPEC, by rule, is not 'optimized' in the way that prime95 and various other
codes are.  IIRC, the rules still only allow you to use compiler switches on
a shipping or soon to be shipping compiler.  You're not allowed to go in and
hand code things in assembly.  In reality, serious chunks of code do get
optimized that way.

> Also was curious if there was any optimizing done in the code (Prime95) to 
> take advantage of an Athlon 64 or an Opteron feature and if not would it be 
> possible to "port" the code to utilize any underutilized abilities?

This has been discussed in the past every time a new chip comes out.  Some
enthusiast (hello :) ) comes around and asks if new feature X will be a killer
for LL testing.  Generally no, but some are nice.  The list of features that
I can remember being great were on chip L2 (late PII/all PIII), better FSB
(K7 and P4), SSE2 as implemented on the P4, and prefetching.  I'm not sure 
about SSE2 on the opterons.  George would be the best to comment on what 
his wish list would be.

> I currently have an Athlon 64 3200 with 1 GB of PC3200 ram.
> I am currently getting .118 iteration time on a 33900000 number.
> I see on our benchmark page that some of the P4s were getting 0.053 in the 
> same range.
> The industry benchmarks dont show Intel x2 performance  of AMD so wondering 
> what is at play here.
> I imagine the old Van Newman bottleneck is hurting AMD here and that it is 
> really an issue more with  memory bandwidth.

SSE2 and the almost 2x higher FSB bandwidth, most likely.  Oh, the higher core
clock doesn't hurt.  This is one reason why opterons are getting better, but
aren't there, yet.  They have better FSB bandwidth (and much better latency),
but I vaguely remember that thier SSE2 implementation wasn't anything to write
home about--can anyone comment on this more precisely?  Oh, and the core is 
still ~60% of a top end P4.

> Would be interesting to see what results would be if the same memory 
> bandwidth  was held the same ( throttle back Intel ) to see what the effect 
> has on iteration times.

If I had a high end P4 system that I could test with, I'd run that for you.
I think I would find it informative, as well.

> To bad we couldnt do the iterations entirely from the CPU registers. ;-)

That is one of the reasons that the Itanium is so much faster.  Instead of 8
or 16 FPU registers, it has 128.  That allows it to do a higher radix FFT
which means less I/O is needed per pass as well as fewer operations.  And,
IIRC, a better mix of add to multiply.  I think it's unlikely to find a chip
that physically has enough registers to hold an LL test in them.  Maybe, if
someone dropped a few billion in cash to Cray....

> This was not intended to be a novel nor the start of the processor wars but 
> he goes.
> 
> Instructions:
> 1. Pull Pin
> 2. Throw

3. Duck.

Remember, kids, once you pull the pin, Mr. Gernade is no longer your friend.

Cheers,
David
_______________________________________________
Prime mailing list
[EMAIL PROTECTED]
http://hogranch.com/mailman/listinfo/prime

Reply via email to