Re: [Open-graphics] Designing a CPU

Timothy Normand Miller Sun, 18 Mar 2007 11:32:39 -0800

On 3/18/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:

I'd like to lift some ideas for the processing pipeline.  I found it
easiest to sketch out the whole pipeline based on Tim's post and earier
model http://lists.duskglow.com/open-graphics/2006-August/006968.html
and incorporate the ideas there.


To summarise the pipeline stages:

    1. Instruction fetch
    2. Register fetch and branch condition
    3. ALU
    4. Memory IO
    5. Write-back

The ideas are

  * Move cheap unary operators from stage 3 to stage 2, e.g. istead of
    implementing {add, sub} in 3, implement only {add} and optional
    negation on one of the operands.  The motivation is that parallel
    functions in the ALU sits idle while one is working, and by moving
    some computation to the previous stage we can instead exploit more
    combinations.


I like this idea.  I'm not sure how well it'll work for an FPGA.

We typically see the pipeline as: logic - register - logic - register, etc.

Well, sometimes, the register is built into the logic in a way that
won't let us insert extra logic.  In particular, the block RAMs are
synchronous.  Read data is available one cycle after the address is
asserted.  That means we can't insert logic between the "address MUX"
and the "pipeline register".

We have a similar problem with the distributed RAM we'd use for the
32-entry register file.  In this case, we CAN do asynchronous reads
from the RAM, but it's inefficient.  Moreover, the address mux is
heavy-weight enough that we may want to limit the pipeline stage to
just that RAM lookup.

Don't get me wrong.  I really like the idea of doing the unary ops in
one stage and then the binary ops in the next.  This saves us from
having to have separate add and sub logic.  (Although, in FPGAs, we
have something called an ADDSUB that combines the two efficiently.)
It's a nice idea to put the NOT operation there too, but we don't
generally have opcodes for things like A&~B, and "not" operations are
generally "free".

  * Let the ALU run independent of whether we do arithemetic or memory
    operations.  In case of memory operations, the output of the ALU is
    the address, thus giving us more powerful addressing "for free".
    For store, the second operand is the stored value so we need to
    weaken the addressing by forcing immediate mode for the second
    argument to the ALU.


MIPS does it this way.  Addressing is "reg_val + immediate_constant".
That's why they put the MEM stage right after the ALU.  I've thought
about the fact that we can support a "reg + reg" mode for reads, but
I'm bothered by the lack of orthogonality.  If we have the opcode
"space", we can consider adding it.  How useful would it be?

  * Only basic one-cycle operations goes in the CPU.  If we need any
    multi-cycle operation (div?), we could add it as IO ports, using
    the "modularised RISC" idea,
    http://lists.duskglow.com/open-graphics/2006-April/005307.html.


Oh, yeah!  Good idea.

Below is the sketch.  I hope you can read the Unicode.  Some
conventions:

  * I have taken the liberty to misuse logical operators as bitwise
    logical operators.

  * Conditional bitwise not:
        (¬)^s x = | x     if s = 0
                  | ¬ x   if s = 1


I get the unicode, but I don't know what you're meaning here.  Can you
rewrite to look like C code?


 1. Instruction Fetch

    Fetch one instruction of the form (qmode, qop, ix, sx, mx, iy, sy, iz, c)
    where

        qmode   2 bits  0 = ARITH, 1 = FETCH, 2 = STORE, 3 = BRANCH
        qop     3 bits  ALU function if qmode ≠ BRANCH, else branch condition
        ix      5 bits  first operand register index
        sx      1 bit   first operand bitwise not option
        mx      1 bit   first operand increment by 1 option
        iy      5 bits  second operand register index
        sy      1 bit   second operand bitwise not option
        iz      5 bits  write-back register
        c       9 bits  immediate value


Have a look at how MIPS breaks it up.  I'm not saying that we should
do it that way, but they put a huge amount of thought into the
instruction format that they ended up with.

Also, here's something to consider:  Each stage is going to do something.

In the REG stage the 'opcode' is the two register numbers.  The
positions of the fields in the instruction are fixed, and we always
look them up, even if we end up throwing away the result.  There is
also an opcode for branch.  Either we'll feed a register value or
something else conditionally or not back into the FETCH stage.

In the ALU stage, we're going to combine A (a register value) with B
(either a reg value or an immediate), and we'll apply some math or
logic op.  So there are some bits there for the ALU op.

For the MEM/IO stage, there are four operations:  Do nothing (forward
to next stage), perform write of B to address from A, read from
address A, or cause shunt from address in A to address in B.  That's a
few bits for an opcode.

Finally, the WB stage... there's a matter of deciding whether or not
something from the previous stage will be written back to a register.
So, we need to encode a register number, and we also need to know
whether or not to write back.

MIPS reserves a reg zero as a "bit bucket".  Writes are thrown away,
and reads always return zero.  We can use this implicitly for some
instructions.

Some of these opcodes we'll want to encode explicitly -- a field is
reserved in the instruction.  For others, we'll encode them implicitly
-- some other opcode is "decoded" somewhere to indicate the desired
operation.

 2. Register Fetch, Operand Computation, and Branch Condition

  2a. Operand Computation.

    Here we compute

        x := (¬)^sx r_ix + mx

        y := | (¬)^sy ⌊2^c r_iy⌋  if iy ≠ 0 and qmode ≠ STORE
             | c                  if iy = 0  or qmode = STORE


Can we get English or C translations of these?

    If we expand the first on in terms of (sx, mx) we get the 4
    operations:

        x := | r_ix       if sx = 0, mx = 0
             | ¬ r_ix     if sx = 1, mx = 0
             | r_ix + 1   if sx = 0, mx = 1
             | - r_ix     if sx = 1, mx = 1


This, I understand.

    I assume splitting up negation into bitwise-not and increment does
    not add significantly to the complexity.  If that is not the case,
    we should drop the mx from the instruction and simplify x to

        x := (-1)^sx r_ix


I think about the best we can do is combine add and sub into an addsub
and make and, or, xor, and not sorta share some logic.  Then there's
shifting, comparisons, and a number of other things.  In an FPGA, you
get better results by not trying to be too efficient with your logic,
while for an ASIC, you can get greater benefits from this sort of
thing.  It depends.

    Register r_0 is hardcoded to 0 for reads.

  2b. Branch Condition.

        do_branch := | false    if qmode ≠ BRANCH
                     | x = 0    if qmode = BRANCH and qop_0 = 0
                     | x ≥ 0    if qmode = BRANCH and qop_0 = 1

    The x ≥ 0 test is cheap (check upper bit).  If the x = 0 test is too
    expensive here, then we could add an equality comparison to the ALU
    which returns -1 for true and 0 for false.


BTW, the BRANCH logic would actually reside in the FETCH stage.

There would be a selection between two possible addresses, PC+1
(already computed in a register), and branch address (fed from REG).

The selection is based on the opcode in the appropriately predeeding
instruction (I think it's two instructions back), also fed through
from REG, and conditions computed from the reg value on the registered
output of the register file.

 3. ALU

        z := | x + y    if qop = ADDSUB
             | x ⋅ y    if qop = MULT
             | x ∧ y    if qop = AND
             | x ∨ y    if qop = OR
             | x ⊻ y    if qop = XOR
             | ⌊2^y x⌋  if qop = SHIFT
             | ...

    Note that the choice of x and y in SHIFT gives the maximum flexibility
    by not duplicating the stage 2 shift operation.

 4. Memory and IO

    Optionally fetch from or store to local memory location z:

        z' := z                 if qmode = ARITH or BRANCH
        z' := M_z               if qmode = FETCH
        M_z := r_iy             if qmode = STORE


Yes.

 5. Write-Back

        r_iz := z'  if iz ≠ 0 and qmode ≠ 11

    We could add a bitwise-not here, and drop OR from the ALU, but I'm
    running out of instruction bits.


You're on the right track here.  Alas, some realities about FPGAs keep
us from doing some things too cleverly.

Note that any ASIC technology we select later is likely to be similar
enough that we might as well just do it this way.

--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Favorite book:  The Design of Everyday Things, Donald A. Norman, ISBN
0-465-06710-7

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Designing a CPU

Reply via email to