On 2007-09-03, Timothy Normand Miller wrote:
> On 9/3/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> >
> > I had in mind writing the result first to res_r, i.e. in the above case
> > statement, thus my insistence that it have no effect on timing.  But if
> > we can MUX it on the res_o output as you indicate, that's even better as
> > it means we can use the result one cycle earlier.  It's worth a try.
> 
> I like your idea better.  The only problem is that this could collide
> with some other write to a register.  Restricting what instructions
> can go into the slot at this time kinda defeats the purpose.

I am on the verge of confusion here, but this should be doable without
breaking instruction/data synchronisation.  After the product-fetching
instruction appears on the ALU, on the next cycle the product will be
registered on res_r, and thereby on res_o, just as if the computation
was done by the ALU.  The difference is that in your version the
product-fetching instruction can be issued one cycle earlier.

> > If I can put the current line of argument in parentheses for a moment,
> > I'd like to consider a more systematic exploitation of register-like
> > entities in the spirit of the IO-architecture (or whatever it's called).
> > The main deficit of the latter is the number of moves (3) required for
> > each operation.  This is related to the redundancy of doing address
> > calculation and supporting a big address space as if IO-ports were
> > memory.  What if we replace IO ports with IO "registers", quite
> > literally.  Let's call them s0, s1, ..., s31 (special registers).  Now,
> > extend the register number bits in the instruction word from 5 to 6, so
> > that we can freely select operands and write-back among {r0, ..., r3,
> > s0, ..., r31}.  This allows us to store computed values directly into
> > IO-ports, not just the multiplication unit, but also DMA and other
> > registers.
> 
> Well, that's not a bad idea.  It's also worth pondering architectures
> that have 512 local registers, unifying the scratch space with the
> register file.  But that may be too radical.

To avoid using most of the instruction just to encode the register
numbers, we could access the upper address bits as a frame pointer to
allow maybe 4 or 8 levels of subroutine calls.  There must be some
overlap for communication.  Say, q0..q63 refers to the r0..r63 registers
of the surrounding frame.  Now, we're almost taking about a stack
machine, but I think we avoid the inefficiencies by having a big area of
the stack available at any time.  One reservation is that now we have no
real static storage.  So maybe the final arrangement would be something
like
        q0..q31 -- previous frame
        r0..r31 -- current frame
        g0..g31 -- global storage
        s0..s31 -- IO-ports

possible with 64 of each kind if the instruction word permits.

> If we extend the register numbers to 6 bits, what does that do to
> other bits in the instruction?

We loose two bits on immediates.  We currently have 15 bits.
 
> > There are several ways to go from here.  For instance, say s0 and s1
> > connects us to the multiplier.  Then a multiply is
> >
> >         move r0, s0     ; first operand
> >         move r1, s1     ; second operand, triggers multiplication
> >         ; wait 16-17 cycles
> >         move s0, r0     ; lower 32 bits of product
> >         move s1, r1     ; upper 32 bits if we need them
> >
> > but, it must be noted that each move instruction can be replaced with a
> > computation, so in the best case scenario, we don't use any instructions
> > on the multiplication itself.  However, that may be too optimistic, and
> > to make it easier to be efficient we may allow both operands to be
> > loaded simultaneously:
> 
> Here's another idea:  Have two register files that work independently.
>  Usually, it doesn't make much difference.  But on occasion, we may
> have a situation where two instructions want to dump results into
> registers at the same time.  As long as they're allocated properly,
> there will be no conflict.  In fact, we can do this in groups of 16
> registers.

That's good to know.  So, assuming even and odd numbered registers are
in separate backs, we'd let the few instructions with double write-back
use fixed pairs (r0, r1), (r2, r3), ..., ignoring the lower bit of the
write-back register number.  Fetching the full 64 bits of the product in
one instruction is probably not a compelling argument to add complexity
to the register stage, but we can keep this in mind if the need arises.
 
> >         move (r0, r1), (s0, s1) ; we'll use a nicer syntax
> >         ; wait 16-17 cycles
> >         move s0, r0             ; lower 32 bits of product
> >         move s1, r1             ; upper 32 bits of product
> >
> > Our register file allows two simultaneous reads, so the double source in
> > the first instruction can be implemented.  Further, the IO-ports are
> > independently controlled, and we can use separate buses for even and odd
> > port numbers.  On the other hand, when writing back the result, we
> > respect that our register file only allows one write at a time.  This is
> > probably not a big loss, since most of the time, we only need the low 32
> > bits of the result.
> 
> If we're going to have to use separate move instructions to fetch the
> result, what is the advantage over fetching them from I/O space?
> Forwarding?

The advantage I think is that the product can immediately be use in a
new computation, just as with the r31 convention considered previously.
For instance if we're accumulating a sum of products into (r1, r0):

        move (r0, r1), (s0, s1)
        ; wait for result
        add s0, r0, r0
        add s1, r1, r1
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to