On 3/18/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:
On 2007-03-18, Timothy Normand Miller wrote:
> We typically see the pipeline as: logic - register - logic - register, etc.
>
> Well, sometimes, the register is built into the logic in a way that
> won't let us insert extra logic. In particular, the block RAMs are
> synchronous. Read data is available one cycle after the address is
> asserted. That means we can't insert logic between the "address MUX"
> and the "pipeline register".
>
> We have a similar problem with the distributed RAM we'd use for the
> 32-entry register file. In this case, we CAN do asynchronous reads
> from the RAM, but it's inefficient. Moreover, the address mux is
> heavy-weight enough that we may want to limit the pipeline stage to
> just that RAM lookup.
So, the first two stages are fully saturated by two lookups. I see.
Yes, unfortunately. Also, we make a convention out of registering
outputs (consistent with the way that these RAMs work), so we have to
figure out how to do all our math from inputs inward.
I wouldn't worry about the instruction set, we can always have the first
assembler use a "sensible" subset, and exploit exotic combinations by
hand-coding critical loops in the last stage of development. (A smarter
assembler would be able to merge instructions.)
Perhaps, but also consider how special-purpose this is going to be.
The spirit of this design will continue forward. We'll always need
something like this (unless we design a shader and then decide to just
use another one in this role), but the specifics of the design will
evolve in ways that are not backward or forward compatible. And it
doesn't matter since it'll all always be hidden behind a driver, from
the user's perspective.
If I understand things right, the 5-stage design already exhibits some
exotic behaviour in register dependency: Even if we short-wire outputs
from the ALU when the output register of one instruction is the input
register to the next, there is still one stage before it is written to
the register file.
True. What's we'll do is something like this: The inputs to the ALU
(the ALU logic, not the ALU stage) will come through a MUX. One of
the sources is the register file. The other sources are the
registered outputs of the ALU stage or the registered outputs of the
MEM stage. We'll need some clever sort of table that tracks which
registers are where. Each operand to the ALU needs to be compared
against only a few register numbers, so it's not too bad. The ALU may
forward more numbers, but it has only one result. Same for the MEM
stage. So we compare each input register to the ALU against two
register numbers. If we can do it one stage in advance (indeed, I
believe we can), even better.
Or do you refer to the sematic asymmetry between FETCH and STORE?
This.
> For the MEM/IO stage, there are four operations: Do nothing (forward
> to next stage), perform write of B to address from A, read from
> address A, or cause shunt from address in A to address in B. That's a
> few bits for an opcode.
So it's possible to move a value from one address to another in the same
instruction within the same block RAM? Do we need it?
The objective is to be able to move data at 66.6 million words per
second. I would be amazed if our processor could exceed 100MHz. That
means if we required one IN and one OUT for each word to be moved from
the master state machine's fifo into the memory write fifo, we
couldn't keep up.
In fact, that makes me realize something I don't have a solution for.
When writing, we need to push two words: data and its address. If
that takes two cycles, we're already screwed.
I think someone suggested the idea of setting up background moves.
That is, we don't move the words. We tell some other logic to do some
number of words. This also frees us from having to deal with pipeline
stalls, because we can just request to move N words (more than are
perhaps available at the time), and it takes an unknown amount of
time, and we pick it up later by checking status.
We're going to end up with an intricate network of queues. Read
request queues, read return queues, write queues, and queues into
which we put requests to do reads and writes. :)
> MIPS reserves a reg zero as a "bit bucket". Writes are thrown away,
> and reads always return zero. We can use this implicitly for some
> instructions.
Great, then we just mandate iz = 0 for STORE.
Basically, yeah.
Sorry,
x =
if (sx == 1) then
return ~r[ix] + mx
else
return r[ix] + mx
y =
if (iy != 0 && qmode != STORE) {
if (sy == 1)
return r[iy] << c
else
return r[iy] << c
} else
return c
With the loss of the stage 2 math, what do these become now?
addsub sounds fine. (I was thinking the alternative was to have two
separate (and expensive) units for this.) I understand we are making
LEGO towers here.
I like the metaphor. :)
> Note that any ASIC technology we select later is likely to be similar
> enough that we might as well just do it this way.
I presume we don't want to change too much, since that would increase
the risk of introducing bugs. Is it fair to argue that we have
significantly more space on the ASIC than on the FGPA, so whatever fits
in the latter fits in the former? That is, we'd mostly need to do stuff
like replacing standard units of one technology with that of the other?
Yes, even if it's not optimal for the ASIC, we want to use our Verilog
code as unmodified as possible. In any case, it'll be faster for the
ASIC.
--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Favorite book: The Design of Everyday Things, Donald A. Norman, ISBN
0-465-06710-7
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)