On 2007-09-03, Timothy Normand Miller wrote:
> On 9/3/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
> >
> > I had in mind writing the result first to res_r, i.e. in the above case
> > statement, thus my insistence that it have no effect on timing. But if
> > we can MUX it on the res_o output as you indicate, that's even better as
> > it means we can use the result one cycle earlier. It's worth a try.
>
> I like your idea better. The only problem is that this could collide
> with some other write to a register. Restricting what instructions
> can go into the slot at this time kinda defeats the purpose.
I am on the verge of confusion here, but this should be doable without
breaking instruction/data synchronisation. After the product-fetching
instruction appears on the ALU, on the next cycle the product will be
registered on res_r, and thereby on res_o, just as if the computation
was done by the ALU. The difference is that in your version the
product-fetching instruction can be issued one cycle earlier.
> > If I can put the current line of argument in parentheses for a moment,
> > I'd like to consider a more systematic exploitation of register-like
> > entities in the spirit of the IO-architecture (or whatever it's called).
> > The main deficit of the latter is the number of moves (3) required for
> > each operation. This is related to the redundancy of doing address
> > calculation and supporting a big address space as if IO-ports were
> > memory. What if we replace IO ports with IO "registers", quite
> > literally. Let's call them s0, s1, ..., s31 (special registers). Now,
> > extend the register number bits in the instruction word from 5 to 6, so
> > that we can freely select operands and write-back among {r0, ..., r3,
> > s0, ..., r31}. This allows us to store computed values directly into
> > IO-ports, not just the multiplication unit, but also DMA and other
> > registers.
>
> Well, that's not a bad idea. It's also worth pondering architectures
> that have 512 local registers, unifying the scratch space with the
> register file. But that may be too radical.
To avoid using most of the instruction just to encode the register
numbers, we could access the upper address bits as a frame pointer to
allow maybe 4 or 8 levels of subroutine calls. There must be some
overlap for communication. Say, q0..q63 refers to the r0..r63 registers
of the surrounding frame. Now, we're almost taking about a stack
machine, but I think we avoid the inefficiencies by having a big area of
the stack available at any time. One reservation is that now we have no
real static storage. So maybe the final arrangement would be something
like
q0..q31 -- previous frame
r0..r31 -- current frame
g0..g31 -- global storage
s0..s31 -- IO-ports
possible with 64 of each kind if the instruction word permits.
> If we extend the register numbers to 6 bits, what does that do to
> other bits in the instruction?
We loose two bits on immediates. We currently have 15 bits.
> > There are several ways to go from here. For instance, say s0 and s1
> > connects us to the multiplier. Then a multiply is
> >
> > move r0, s0 ; first operand
> > move r1, s1 ; second operand, triggers multiplication
> > ; wait 16-17 cycles
> > move s0, r0 ; lower 32 bits of product
> > move s1, r1 ; upper 32 bits if we need them
> >
> > but, it must be noted that each move instruction can be replaced with a
> > computation, so in the best case scenario, we don't use any instructions
> > on the multiplication itself. However, that may be too optimistic, and
> > to make it easier to be efficient we may allow both operands to be
> > loaded simultaneously:
>
> Here's another idea: Have two register files that work independently.
> Usually, it doesn't make much difference. But on occasion, we may
> have a situation where two instructions want to dump results into
> registers at the same time. As long as they're allocated properly,
> there will be no conflict. In fact, we can do this in groups of 16
> registers.
That's good to know. So, assuming even and odd numbered registers are
in separate backs, we'd let the few instructions with double write-back
use fixed pairs (r0, r1), (r2, r3), ..., ignoring the lower bit of the
write-back register number. Fetching the full 64 bits of the product in
one instruction is probably not a compelling argument to add complexity
to the register stage, but we can keep this in mind if the need arises.
> > move (r0, r1), (s0, s1) ; we'll use a nicer syntax
> > ; wait 16-17 cycles
> > move s0, r0 ; lower 32 bits of product
> > move s1, r1 ; upper 32 bits of product
> >
> > Our register file allows two simultaneous reads, so the double source in
> > the first instruction can be implemented. Further, the IO-ports are
> > independently controlled, and we can use separate buses for even and odd
> > port numbers. On the other hand, when writing back the result, we
> > respect that our register file only allows one write at a time. This is
> > probably not a big loss, since most of the time, we only need the low 32
> > bits of the result.
>
> If we're going to have to use separate move instructions to fetch the
> result, what is the advantage over fetching them from I/O space?
> Forwarding?
The advantage I think is that the product can immediately be use in a
new computation, just as with the r31 convention considered previously.
For instance if we're accumulating a sum of products into (r1, r0):
move (r0, r1), (s0, s1)
; wait for result
add s0, r0, r0
add s1, r1, r1
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)