Petter Urkedal wrote:
On 2007-08-13, Mark wrote:
Petter Urkedal wrote:
If we want to do a compromise, we could instead implement 32x16->32
multiply.  That is, two multipliers in the ALU stage, and an adder in
the IO stage.  If again we incorporate the shifts, we are down to 4
instructions for to compute a 32x32->32 product:
mul_32x32_from_32x16:
        mul/h   r0, r1, r3      ; r3 := r0 * r1[31:16]
        mul/l   r0, r1, r2      ; r2 := r0 * r1[15:0]
        shift   r3, 16, r3
        add     r2, r3, r2
Note that register forwarding does not work fully for the mul
instruction in this case, since it's split over two stages.  There is a
1 cycle delay before we can use the result, which means this is the only
way to order the instructions.
My guess is that the 16x16->32 multiplier with shifts on both the second
operand and the result is much cheaper than the extra adder and
multiplier of the 32x16->32 solution, and we save only one instruction
by by going to 32x16->32.
How about if the shift was implicit in mul/h? That should be cheap in terms of hardware and it would decrease the cost of the soft 32x32 multiply to three cycles -- wouldn't it? (Sorry -- I have yet to read up on your architecture in detail.)

That's what I did in the 16x16->32 case, but in this case, the two-stage
mul/l instruction will not have a result ready at the point of the shift
instruction, so we can't save that cycle anyway.

The benefit, in my mind, is that you get a slot in which you can schedule something else.

ALU                     IO
issue mulh              -
issue mull              mulh completes
<free>                  mull completes
issue add               <free>
-                       add ready

Besides, if the adder is in the IO stage, can't you pack that all into three cycles?

ALU                     IO
issue mulh              -
issue mull              mulh completes
issue add               mull completes
-                       perform addition

Of course, I could be missing something fundamentally obvious here.

Just to be clear, I'm suggesting that mulh be something like
  rC[31:16] := rA * rB[31:16]
  rC[15:0]  := 0
All this requires is a 2:1 mux at the output of the multiplier (possibly retimed into a later stage). I'm not suggesting you reuse your ALU barrel-shifter for this simple, special-purpose shift.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to