On 2007-09-01, Timothy Normand Miller wrote:
> On 9/1/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> > On 2007-09-01, Timothy Normand Miller wrote:
> > > I'm not sure we want to add additional MUXing after the REG stage.  It
> > > might be better to move it into the MEM stage.  This is especially not
> > > a problem since we have gobs of time to schedule when the product is
> > > grabbed.
> >
> > My idea is to put it after the register fetches are registered and
> > parallel to the other QOP_ cases in the ALU.  Note that the expensive
> > part of the ALU, the add is already MUXed separately on the ALU output,
> >
> > assign res_o = res_is_add_r? res_if_add_r : res_r;
> >
> > where res_is_add_r is registered.  So, I don't think adding a multiply
> > case (or actually, replacing the current case with "sink" and "source"
> > cases), will have an effect on timing.  At least in theory, though as
> > pointed out no long ago, the synthesis results are easily disturbed.
> 
> I think you're right that 3-to-1 or 4-to-1 shouldn't be much worse
> than 2-to-1.  We'll see.  (Of course, we cannot inject it _before_ the
> register, because that's built into the memory for the register file.)

I had in mind writing the result first to res_r, i.e. in the above case
statement, thus my insistence that it have no effect on timing.  But if
we can MUX it on the res_o output as you indicate, that's even better as
it means we can use the result one cycle earlier.  It's worth a try.

> > I think the IO approach is nice due to the fact that it completely
> > decouples the operation from the CPU, thus separate people can easily
> > maintain the two pieces of code.  However, that presumes we don't push
> > the result into a fixed register.  But, it seems a bit odd to me if we
> > write the operand to IO-ports 8 and 9 and magically get back the result
> > in r31 without an IO-read; did I misunderstand?
> 
> One extreme is to have an instruction that initiates the multiply
> "directly" and have the result magically appear in r31.  Another
> extreme is to use I/O ports 8 and 9 for the inputs and then later read
> from port 10 for the output.  There are also combinations of those
> things, and we shouldn't avoid one just because it seems weird.
> 
> If we're going to 'override' r31, I think we should make it dependent
> only on whether or not there's a pending product.  That is, if a
> product is not pending, we read the real r31.  If a product is
> pending, reading r31 causes the product to be fetched AND resets the
> override.  (Thus, we have only one opportunity to fetch the product
> before it effectively disappears.)  Writes to r31 would have no effect
> either way.  This is a bit weird, I have to admit, so maybe we should
> consider other ways of going about it.
>
> > Also, it's worth considering that even if I am right about the timing,
> > we save a bit of instruction decoding logic by using fully IO-port based
> > approach.  I think the trade-off is 3-instruction multiply in 36 cycles
> > versus 1-instruction multiply in 34 cycles.  So, it's not a big
> > difference.
> 
> That's what I was thinking, but there's also an argument to be made
> for using some dedicated instructions.  When doing only through I/O,
> we need to execute three instructions (and use up the corresponding
> three additional cycles), two to set up the product, and one to fetch
> the result.  When using a dedicated instruction and a register
> override, we need one extra instruction (one to set up the multiply,
> zero to fetch it, because it can be an operand directly in another
> instruction).

If I can put the current line of argument in parentheses for a moment,
I'd like to consider a more systematic exploitation of register-like
entities in the spirit of the IO-architecture (or whatever it's called).
The main deficit of the latter is the number of moves (3) required for
each operation.  This is related to the redundancy of doing address
calculation and supporting a big address space as if IO-ports were
memory.  What if we replace IO ports with IO "registers", quite
literally.  Let's call them s0, s1, ..., s31 (special registers).  Now,
extend the register number bits in the instruction word from 5 to 6, so
that we can freely select operands and write-back among {r0, ..., r3,
s0, ..., r31}.  This allows us to store computed values directly into
IO-ports, not just the multiplication unit, but also DMA and other
registers.

There are several ways to go from here.  For instance, say s0 and s1
connects us to the multiplier.  Then a multiply is

        move r0, s0     ; first operand
        move r1, s1     ; second operand, triggers multiplication
        ; wait 16-17 cycles
        move s0, r0     ; lower 32 bits of product
        move s1, r1     ; upper 32 bits if we need them

but, it must be noted that each move instruction can be replaced with a
computation, so in the best case scenario, we don't use any instructions
on the multiplication itself.  However, that may be too optimistic, and
to make it easier to be efficient we may allow both operands to be
loaded simultaneously:

        move (r0, r1), (s0, s1) ; we'll use a nicer syntax
        ; wait 16-17 cycles
        move s0, r0             ; lower 32 bits of product
        move s1, r1             ; upper 32 bits of product

Our register file allows two simultaneous reads, so the double source in
the first instruction can be implemented.  Further, the IO-ports are
independently controlled, and we can use separate buses for even and odd
port numbers.  On the other hand, when writing back the result, we
respect that our register file only allows one write at a time.  This is
probably not a big loss, since most of the time, we only need the low 32
bits of the result.

> > If interrupt handlers are simpler than normal code, we could just decide
> > it will only use a few registers, but otherwise doubling the register
> > file makes sense.  We don't need to change the instruction word.  The
> > nanocontroller will arrange to fill in the upper bit on read and
> > write-back.
> 
> Yeah.  Sort of a "interrupt mode" state bit that is used as the
> higher-order bit of the register index.

I though about the possibility of dual-threading the nanocontroller.  It
would have some advantages like avoiding the delay-slot, and allowing
use to pipeline any arithmetic operation over two stages without effect
on the code.  However, the extra pipelining opportunities would probably
account for less than a factor 2 in speed, so we'd need to utilise the
threads, meaning we can't afford to let one just wait for interrupts.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to