On 9/3/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> On 2007-09-03, Timothy Normand Miller wrote:
> > On 9/3/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> > >
> > > I had in mind writing the result first to res_r, i.e. in the above case
> > > statement, thus my insistence that it have no effect on timing.  But if
> > > we can MUX it on the res_o output as you indicate, that's even better as
> > > it means we can use the result one cycle earlier.  It's worth a try.
> >
> > I like your idea better.  The only problem is that this could collide
> > with some other write to a register.  Restricting what instructions
> > can go into the slot at this time kinda defeats the purpose.
>
> I am on the verge of confusion here, but this should be doable without
> breaking instruction/data synchronisation.  After the product-fetching
> instruction appears on the ALU, on the next cycle the product will be
> registered on res_r, and thereby on res_o, just as if the computation
> was done by the ALU.  The difference is that in your version the
> product-fetching instruction can be issued one cycle earlier.

I was assuming that the product would be dumped into a register just
like any other ALU result.  And I was saying that that's a problem.

> > > If I can put the current line of argument in parentheses for a moment,
> > > I'd like to consider a more systematic exploitation of register-like
> > > entities in the spirit of the IO-architecture (or whatever it's called).
> > > The main deficit of the latter is the number of moves (3) required for
> > > each operation.  This is related to the redundancy of doing address
> > > calculation and supporting a big address space as if IO-ports were
> > > memory.  What if we replace IO ports with IO "registers", quite
> > > literally.  Let's call them s0, s1, ..., s31 (special registers).  Now,
> > > extend the register number bits in the instruction word from 5 to 6, so
> > > that we can freely select operands and write-back among {r0, ..., r3,
> > > s0, ..., r31}.  This allows us to store computed values directly into
> > > IO-ports, not just the multiplication unit, but also DMA and other
> > > registers.
> >
> > Well, that's not a bad idea.  It's also worth pondering architectures
> > that have 512 local registers, unifying the scratch space with the
> > register file.  But that may be too radical.
>
> To avoid using most of the instruction just to encode the register
> numbers, we could access the upper address bits as a frame pointer to
> allow maybe 4 or 8 levels of subroutine calls.  There must be some
> overlap for communication.  Say, q0..q63 refers to the r0..r63 registers
> of the surrounding frame.  Now, we're almost taking about a stack
> machine, but I think we avoid the inefficiencies by having a big area of
> the stack available at any time.  One reservation is that now we have no
> real static storage.  So maybe the final arrangement would be something
> like
>         q0..q31 -- previous frame
>         r0..r31 -- current frame
>         g0..g31 -- global storage
>         s0..s31 -- IO-ports

This reminds me of SPARC too, and I think that's just too much
complexity for any CPU, let alone a nanocontroller for our purposes.

I think we should keep it simple and just deal with some inefficiency
here and there for the sake of keeping the logic to a minimum.  I
entertain the idea of having separate registers for interrupt mode,
but only because it saves a huge amount of overhead for ISRs.

>
> possible with 64 of each kind if the instruction word permits.
>
> > If we extend the register numbers to 6 bits, what does that do to
> > other bits in the instruction?
>
> We loose two bits on immediates.  We currently have 15 bits.

I'm not sure how important that is, but I think that 15-bit immediates
may turn out to be rather common.  I'm thinking about both GPU packet
decode and VGA.

> > Here's another idea:  Have two register files that work independently.
> >  Usually, it doesn't make much difference.  But on occasion, we may
> > have a situation where two instructions want to dump results into
> > registers at the same time.  As long as they're allocated properly,
> > there will be no conflict.  In fact, we can do this in groups of 16
> > registers.
>
> That's good to know.  So, assuming even and odd numbered registers are
> in separate backs, we'd let the few instructions with double write-back
> use fixed pairs (r0, r1), (r2, r3), ..., ignoring the lower bit of the
> write-back register number.  Fetching the full 64 bits of the product in
> one instruction is probably not a compelling argument to add complexity
> to the register stage, but we can keep this in mind if the need arises.

I completely agree.  For now, the main advantage is being able to
retire a mult at the same time as something else, in the case where
the mult writes directly to the reg file.

> > If we're going to have to use separate move instructions to fetch the
> > result, what is the advantage over fetching them from I/O space?
> > Forwarding?
>
> The advantage I think is that the product can immediately be use in a
> new computation, just as with the r31 convention considered previously.
> For instance if we're accumulating a sum of products into (r1, r0):

Ok.  But the more we do this, the more multiplexing is added to the
inputs to the ALU.  I think I'd rather ensure that REG and ALU are
fast.  MEM may also suffer from some delay issues, but it's almost
entirely there to do memory access; lacking other things to do, we
should be able to make it fast enough.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to