Mersenne: Re: RISC, shifting, 21264

Ernst W. Mayer Thu, 17 Sep 1998 18:57:28 -0400

Yves Gallot wrote:

> The PII is able to shift two numbers by n positions in one step. The PII is
> a Post-RISC processor, ie it is able to execute 2 or 3 instructions per
> cycle and each basic instruction takes 1 cycle. The CISC and RISC
> architectures are both obsolete.

Re. Yves' comment, I suppose that depends on how you define a RISC.
I see no incompatibility with the idea of a streamlined instruction
set and a CPU that can issue/execute multiple instructions per clock.

>each basic instruction takes 1 cycle.

Ooh, we need to be careful here - perhaps this is true for PII, but
e.g. for Alpha (all models), many instructions (especially integer
multiplies (IMULs), high and low) take many more cycles to complete.
Thus, one must take into account pipelining, as well.

Here's an example: the Alpha 21164 has long latencies for integer
multiplies. What is more, there are added latencies, depending on
the source of the data, and latencies from the start of one multiply
until the next can issue (i.e. IMULs are only partially pipelined).
I quote from the 21164 Hardware Reference, Table 2-9, Instruction
latencies:

{Note:
 IMULL  = low 32 bits of a product of two 32-bit ints
 IMULQ  = full 64 bits of a product of two 32-bit ints, or low 64 bits
         of a product of two 64-bit ints
 IMULH = high 64 bits of a product of two 64-bit ints}

                                                Additional time before
                                                result available to integer
Class   Latency                                 multiply unit
---------------------------------------------------------------------------

IMULL   Latency=8, plus up to 2 cycles of added latency,         1 cycle
        depending on the source of the data. Latncy until
        next IMULL, IMULQ or IMULH instruction can
        issue (if there are no data dependencies) is 4 cycles
        plus the number of cycles added to the latency.

IMULQ   Latency=12, plus up to 2 cycles of added latency,        1 cycle
        depending on the source of the data. Latncy until
        next IMULL, IMULQ or IMULH instruction can
        issue (if there are no data dependencies) is 8 cycles
        plus the number of cycles added to the latency.

IMULH   Latency=14, plus up to 2 cycles of added latency,        1 cycle
        depending on the source of the data. Latency until
        next IMULL, IMULQ or IMULH instruction can
        issue (if there are no data dependencies) is 8 cycles
        plus the number of cycles added to the latency.

Thus, e.g. for an IMULH instruction, we may have to wait up to
16 cycles (!) for the result to be available, and in any event
(i.e. even for independent data) must wait at least 8 cycles before
starting the next IMULQ or IMULH.

These numbers explain why, e.g. Peter Montgomery's Mfactor code,
as well-written as it is, is not terribly faster on the 21164
than the Prime95 sieve factorer on a Pentium of similar cycle time:
In Mfactor, each modular multiply x*y mod q (the factoring is based
on these) needs 2 MULQs and 2 MULHs, with only a few intervening
low-latency operations, so the CPU spends a significant fraction of
its time idling, waiting for the results of IMULs to become available,
or waiting to be able to start the next IMUL. Prime95 uses the 64-bit
significand of Pentium floating registers (this extra-long mantissa
is a peculiarity of the x86 series, which happens to prove exceedingly
useful in this application) to store integer operands during factoring,
and thus can use low-latency, fully pipelined floating-point operations
(adds, subtracts and multiplies) to effect x*y mod q.

In an effort to increase the pipelineability of the calculation of
the full product of two 64-bit integers on the Alpha (21064 and 21164),
Peter Montgomery and I came up with a scheme that uses the floating-
point multiplier (and some error-correcting operations, needed to
cure round-off errors in the FMUL result) to replace the MULH instruction.
This scheme proves significantly faster than the obvious MULQ/MULH pair,
even though the operation count shoots up from 2 to a whopping 13. That
is because the FP/MULH scheme uses just one high-latency IMUL, right at
the start, plus 7 additional low-latency integer instructions (mainly
shifts and adds), two FMULs (4-cycle latency, fully pipelined) and three
floating<->integer conversions, which turn out to be freebies, since on
the Alpha the conversions are done in the floating adder, which is
otherwise unoccupied. A classic example of simple opcounts being highly
misleading.

Alas, the above scheme is not terribly useful for factoring, since it
only works for operands of 58 bits or less.

The Alpha also has quite a few in-register byte manipulation instructions,
and one is tempted to replace, e.g. a shift by a multiple of 8 bits
using the barrel shifter to one done in-register, but here one runs
into the problem that the Alpha can only issue 1 or 2 integer instructions
(depending on the particular CPU and the code being run) per cycle, including
in-register instructions. Thus, unless moving a shift instruction from the
shifter to the register frees up the shifter to do something else, and
the something else doesn't interfere with another instruction that is
trying to execute (that's a lot of if's), no gain is to be had.

Another example of a high-latency operation (on Pentiums and most RISCs)
is floating-point divide. The 21164 Hardware Reference lists a latency
of 22 to 60 cycles for an 8-byte floating divide - even though the Alpha
FDIV happens to be fully pipelined, the monstrous latency makes it really
hard for the compiler/ASM programmer to write good code taking account
of latencies. For this reason, you'll find few or no floating divides
in any of the faster Lucas-Lehmer codes: one either avoids divides
altogether (if possible), or hopes they are divides by constants, in
which case one precomputes the inverse and uses floating multiplies by
that (which are fast) to effect the divide.

Bojan Antonovic wrote:

>PS: Have you read about the architecture of the Alpha 21264 ? It`s interesting!

As far as integer operations are concerned, there's one huge change
from the 21164 to the 21264 - the 21264 supposedly can issue one IMUL
every cycle (i.e. IMULs are fully pipelined), rather than every eighth.
For this reason, IMUL-based factorers should really scream on the 21264.

The 21264 is able to issue up to 4 instructions (2 integer, 2 floating)
per cycle, has a mammoth number of registers (80 floating, of which
however only 32 are available to the ASM programmer - the others are
reserved or used for speculative pre-execution), and uses out-of-order
execution (along with those extra FP registers) to try to speed things up.

Through my participation in the SPEC98 benchmark search, I have been able
to get some timings of my LL code on the 21264, but I'm not allowed
to publicize them, since the 21264 is still in final development, and
hence actual performance data are proprietary. I'll say only that it
will be an impressive performer, but don't expect miracles, especially
for floating-heavy code - a maximum of 2 instructions per cycle limits
what you can do per cycle, though of course the reduced minimum cycle
times the 21264 will eventually achieve will speed things up further.

Regards,
Ernst
Mersenne: Re: RISC, shifting, 21264

Reply via email to