Yves Gallot wrote:

>if you compare a well-written C code on the Alpha (who write in assembler on
>this chip ?) with a code in assembler on the PII/400

Is that a fair comparison? The first implication, that writing assembly
code for Alpha is somehow intrinsically harder than for x86, is simply
not true. In fact, I would argue that that x86, owing to its pitifully
small number of registers, almost forces one to write assembly code in
order to achieve decent performance. That's not to say an excellent
optimizing compiler written for x86 cannot do a good job, merely that
it's easier to do a good job (without hand-coding everything in assembly)
when one is not constantly having to play games to relieve register
allocation pressure. The other issue here is one of economics: on a high-
end machine with a good compiler, if I know I can get 80-90% of the
performance of a hand-tuned, labor-intensive assembly code while still
enjoying the benefits of programming in a high-level language, it will
rarely be worth the effort to do it in ASM. At worst, I might only want
to code one or two critical (and hopefully not too complicated) routines
in assembly.

In the case of George's Prime95, it paid to go to ASM because (a) on the
x86, one can expect a significant performance gain by doing so (perhaps
George can give us an idea of how large the gain was), and (b) there are
so many Intel-based machines out there, it's worth doing the extra work
it takes to squeeze every last percent of performance out of the processor.


>The MIPS R10000 is an "old" processor and is not fast today.

I disagree - it may be "old" but it's still pretty good. For typical
floating-point heavy HLL code I consistently find the R10000 runs about
twice as fast as an Alpha 21164 at the same clock rate, and since the
Alpha 21264 will certainly not run twice as fast as a 21164 (at the same
clock rate) for this kind of code, one can expect an R10000 to still
compare well to the 21264. Of course, the best Alphas run at a higher
clock rate than the best R10000s, so of course one should expect a 600MHz
21264 to beat a 250MHz R10000.

Of course there are always certain kinds of computation where one processor
does much better than the other, sometimes in surprising fashion. A recent
example I encountered can serve to illustrate: I had written a Lehmer-type
GCD which ran very quickly on the Alpha, but made heavy use of the Alpha
UMULH (upper 64 bits of the product of two 64-bit unsigned ints) instruction.
In order to allow the code to run on architectures lacking a 128-bit integer
product capability, I also coded a version that did a floating-multiply-based
emulation of UMULH (this requires some additional integer and error correction
operations, but is not terribly complex.) The resulting code ran nearly as
quickly on a 21164 as the UMULH-based code, despite the fact that it had to
do roughly 20% more passes (the FP emulation of UMULH allows one to multiply
a 64-bit integer by one up to "only" 52 bits in length and still get the
correct result, so the scalar multipliers used in the Lehmer GCD are on
average 20% smaller, hence ~20% more passes are needed.)

However, things got interesting when I ran the FP-based code on the R10000.
The reason I tried this is that although the MIPS has a 128-bit integer
multiply instruction (the DMULTU instruction) like the Alpha, unlike the
Alpha it's not as easy to inline into HLL code. So I went with the generic
FP version. Surprise - on the R10000 it absolutely crawled - it was thirty
times slower than I expected. After ruling the obvious suspects, I contacted
Peter Montgomery, who is more familiar with the R10000 than I, about it.
Several days of detective work ensued. The first thing Peter did was to
write a small MIPS ASM code segment to supply the 128-bit DMULTU result.
The resulting all-integer code was 20 times faster as the FP version. The
relative difference in speeds was still unaccounted for. After some more
work, Peter found a huge disprity in how much time the R10000 was taking
to effect integer-->floating conversion, depending on whether the integer
operand had more than 51 bits or not. He then contacted Torbjo"rn Granlund
of the GMP project, who found the following skeleton in the R10000 designers'
closet:

Version 2.0 of October 10, 1996 MIPS R10000 Microprocessor User's Manual
312                                                           Chapter 15.

15.5 FPU Instructions
   This section describes the R10000 processor-specific implementations of
   the following FPU instructions:
      * CVT.L.fmt
      * moves and conditional moves
      * CFC1/CTC1CVT.L.fmt

CVT.L.fmt
   The CVT.L.fmt instruction has a slight change in the R10000 processor
   implementation. The R4400 processor allows conversion from a single or a
   double up to a 53-bit long integer. If the result is greater than 53 bits
   after the conversion, an Unimplemented Operation exception is taken. A
   back- conversion from a 53- bit long integer to single/double also takes
   an Unimplemented Operation exception.

Errata
   In the R10000, the conversion allows only up to 51 bits; otherwise an
   Unimplemented Operation exception is taken. The back-conversion from a
   51-bit long integer to single/double no longer takes an Unimplemented
   Operation exception.


Meaning, the R10000 can't do conversions of integers >= 2^51 in hardware -
it emulates such conversions in software! One can understand why a designer
might be tempted to take such a shortcut (after all, we all know that no
one in their right mind tries to mix floating and integer operations in
their code, right?) but for a supposedly high-end (and assuredly expensive)
processor like the R10000, this sort of thing is rather shocking.

I note that the R10000 is not the only high-end CPU that has "dirty little
secrets" like this: for example, the Alpha 21164 has two 64-bit integer
units, which makes one think it should really scream at integer arithmetic.
And it does, in many cases, unless one has an integer-multiply-heavy code.
That runs surprising slowly ("slow" of course being a relative term here),
because only one of those integer units can mutliply, and on that unit,
mutliplies are essentially unpipelined - the 21164 can start a new IMUL
at best every eight clock cycles. Thankfully, the 21264 will fully
pipeline IMULs.

My point is that one needs to be very careful in defining "fast": is one
talking about HLL or ASM code? Floating-point or integer-dominated? Nice
mix of operations or predominantly one type of operation? Register-greedy
or not? Large memory footprint or small? Distributable computation, or
needing lots of interprocessor communication? & cetera ...

Regards,
Ernst

Reply via email to