On 2007-08-12, Timothy Normand Miller wrote:
> So, we have some synthesis results.  The winner is:  The multiplier.
> To make a 32x32 multiplier, four of the 18x18's have to be bolted
> together, and this is what we get:

Thanks for the results.  Based on the pure algebra, if we don't report
the upper 32 bits of the 64 bit result, then the synthesizer should be
able to eliminate one of the four multipliers.  But, that will probably
not help us, so


Let's first consider having a single 16x16->32 multiplier.  That is, we
have the instruction

        mul/ll rX, rY, rZ

which multiplies the low 16 bits of rX and rY, and stores the result in
rZ.  In order to shove off two shift instruction from the following
32x32->32 software multiply, I'll also assume

        mul/lh rX, rY, rZ

which multiplies the low 16 bits of rX with the high 16 bits of rY.
Then we can do a 32x32 multiply in 6 instructions:

mul_32x32_from_16x16:
        mul/lh  r0, r1, r2
        mul/lh  r1, r0, r3
        add     r2, r3, r2
        shift   r2, 16, r2
        mul/ll  r0, r1, r3
        add     r3, r2, r2

To generalise the /lh-modifiers, we could allow 16 left shifts on the
second operand for any non-immediate instruction.  If we also throw in
16 bit right shift on the result, then the above code reduces to 5
instructions.


If we want to do a compromise, we could instead implement 32x16->32
multiply.  That is, two multipliers in the ALU stage, and an adder in
the IO stage.  If again we incorporate the shifts, we are down to 4
instructions for to compute a 32x32->32 product:

mul_32x32_from_32x16:
        mul/h   r0, r1, r3      ; r3 := r0 * r1[31:16]
        mul/l   r0, r1, r2      ; r2 := r0 * r1[15:0]
        shift   r3, 16, r3
        add     r2, r3, r2

Note that register forwarding does not work fully for the mul
instruction in this case, since it's split over two stages.  There is a
1 cycle delay before we can use the result, which means this is the only
way to order the instructions.


My guess is that the 16x16->32 multiplier with shifts on both the second
operand and the result is much cheaper than the extra adder and
multiplier of the 32x16->32 solution, and we save only one instruction
by by going to 32x16->32.

Comparing to a full 32x32->64 two-stage multiply...  I don't know how
frequent a full 32 bit multiply will be in time critical sections of
OGA1 firmware.  If there is one full multiply in every 8 instructions,
then we'd need 1.5 times higher frequency to compensate for the 5
instruction software multiply.  Trade-offs, trade-offs, ...
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to