> I took 1 graduate level course in Chip design in college ( this makes me an > expert right? ;-). > Back then the K6 had 3 "floating point pipes?" where the pentium only had 1 > ( this was several beers ago so bare with me). > So on each cycle the K6 could potentially do 3 times the work. > Problem was none of the code of the time was written to take advantage of > the additional (pipes)? (Please correct me if I am using the wrong > terminology.)
The K6 had a hopelesly outclassed FPU--it was an evolved form of what they used when they had a license to make 486s--and performed little better. You are thinking of the K7 core--mostly similar to the K8. You are correct in that software designed for a P3 will not take full advantage of the K7 core. Things get a lot more complex when the P4 is thrown into the picture. > The below statement makes me wonder if Intel really has a better platform > for Floating point operations than AMD or if the code has still not been > totally optimized to utilize functionality in AMD Opterons? Pay close attention to which Intel chips were mentioned--the Itanium. It's the $4k/pop server chip. For many tasks, it's very unimpressive, but for single threaded FP, it's a monster. dollar for dollar, you can beat it with about anything, but some people sometimes need one program to go really fast. Mostly, these are tasks that aren't 'embarrasingly parallel', such as LL testing. > I had always heard that AMD was better for FP performance ( I prefer AMD so > I am biased ). > AMD seems to continually lag behind Intel on memory bandwidth and bus > speeds. To some extent, this has been true since the K7 core came out and give Intels P3 a wake up call. But, the P4 works very well in certain tasks--if you optimize it extensively and it fits the structures available on the chip. This Itanium is similar, but is more general purpose and much more expensive. > Was wondering if anyone was willing to open floor to a discussion on which > tests the industry measures Floating point performance and which of these > tests would most closely relate to running Prime95. Or perhaps in this case > the memory benchmarks are equally important. Take a look at the STREAM benchmark for a good benchmark in this area. > Of course this could open a whole can of wurms as to whether these tests are > really "optimized" for the processor / platform? SPEC, by rule, is not 'optimized' in the way that prime95 and various other codes are. IIRC, the rules still only allow you to use compiler switches on a shipping or soon to be shipping compiler. You're not allowed to go in and hand code things in assembly. In reality, serious chunks of code do get optimized that way. > Also was curious if there was any optimizing done in the code (Prime95) to > take advantage of an Athlon 64 or an Opteron feature and if not would it be > possible to "port" the code to utilize any underutilized abilities? This has been discussed in the past every time a new chip comes out. Some enthusiast (hello :) ) comes around and asks if new feature X will be a killer for LL testing. Generally no, but some are nice. The list of features that I can remember being great were on chip L2 (late PII/all PIII), better FSB (K7 and P4), SSE2 as implemented on the P4, and prefetching. I'm not sure about SSE2 on the opterons. George would be the best to comment on what his wish list would be. > I currently have an Athlon 64 3200 with 1 GB of PC3200 ram. > I am currently getting .118 iteration time on a 33900000 number. > I see on our benchmark page that some of the P4s were getting 0.053 in the > same range. > The industry benchmarks dont show Intel x2 performance of AMD so wondering > what is at play here. > I imagine the old Van Newman bottleneck is hurting AMD here and that it is > really an issue more with memory bandwidth. SSE2 and the almost 2x higher FSB bandwidth, most likely. Oh, the higher core clock doesn't hurt. This is one reason why opterons are getting better, but aren't there, yet. They have better FSB bandwidth (and much better latency), but I vaguely remember that thier SSE2 implementation wasn't anything to write home about--can anyone comment on this more precisely? Oh, and the core is still ~60% of a top end P4. > Would be interesting to see what results would be if the same memory > bandwidth was held the same ( throttle back Intel ) to see what the effect > has on iteration times. If I had a high end P4 system that I could test with, I'd run that for you. I think I would find it informative, as well. > To bad we couldnt do the iterations entirely from the CPU registers. ;-) That is one of the reasons that the Itanium is so much faster. Instead of 8 or 16 FPU registers, it has 128. That allows it to do a higher radix FFT which means less I/O is needed per pass as well as fewer operations. And, IIRC, a better mix of add to multiply. I think it's unlikely to find a chip that physically has enough registers to hold an LL test in them. Maybe, if someone dropped a few billion in cash to Cray.... > This was not intended to be a novel nor the start of the processor wars but > he goes. > > Instructions: > 1. Pull Pin > 2. Throw 3. Duck. Remember, kids, once you pull the pin, Mr. Gernade is no longer your friend. Cheers, David _______________________________________________ Prime mailing list [EMAIL PROTECTED] http://hogranch.com/mailman/listinfo/prime
