Re: [Open-graphics] The Central Processor

Timothy Normand Miller Wed, 28 Mar 2007 11:46:46 -0800

On 3/28/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:


[snip]

I'll go on with an overview of the current CPU.  It should be
accompanied with the Verilog source, and will hopefully help understand
it.  We can go into more detail in follow-up posts.


Beautiful summary of, well, everything important about this design.  :)

    `define QOP_SHL   3 // Left shift for rY > 0, right shift for rY < 0


Typically we have separate instructions for left-shift,
right-shift-logical (unsigned), and right-shift-arithmetic (signed).
You COULD get the direction out of the sign bit of the shift amount,
but compilers aren't used to this, and depending on how many unused
opcodes you have, it might require less logic to just have more than
one shift operation than to have to look at another signals to
determine direction.  Plus, you can't distinguish between signed and
unsigned shifts this way.

    `define QOP_ADD   4
    `define QOP_RSUB  5 // Reversed args for maximum flexibility.


Since any register can be specified as either operand, there's almost
no benefit in having a way to subtract with the operand order
reverse... just reverse them in the code.  (Except for one case
involving an immediate.)  We do need separate ADD and SUB opcodes.
Also, making one an even number and the other odd helps, because we
can route the low bit of the opcode into the addsub to set its
direction.

The way we get an addsub inferred is to put it into a module by itself
so that the heuristics in the synthesizer will recognize it correctly.
Basically what we want is this:

module my_addsub(
   input [31:0] A,
   input [31:0] B,
   output [31:0] C,
   input sub);
assign C = sub ? (A-B) : (A+B);
endmodule

In my experience, both Lattice and Xilin synthesizers have done the
right thing with this.

    `define QOP_MULT  6

If we combine these with the QMODE_ARITH mode, then we get the instructions

    and rX, rcY, rZ     ; rZ := rX & rcY
    or rX, rcY, rZ
    xor rX, rcY, rZ
    lsh rX, rcY, rZ     ; rZ := if rcY < 0 then rX >> -rcY else rX << rcY
    add rX, rcY, rZ     ; rZ := rcY + rZ
    rsub rX, rcY, rZ    ; rZ := rcY - rX
    mult rX, rcY, rZ    ; rZ := rX * rcY        ; Note: always signed!


Now that you mention that, perhaps we'd like to have signed and
unsigned multiply instructions.

The built-in multipliers are 18*18 -> 36.  I think you need four of
them to make 36*36->72.  They're signed at 36 bits, so for us to do
signed or not, we just need to decide whether or not we replicate the
high bits of each operand out to the full word length.

The other thing to be figured out is how to deal with the upper 32
bits.  If you multiply two 32-bit numbers, you get a 64-bit word.  I
think MIPS used a special register to hold that, so you could fetch it
later.  But the multipliers we have access to are possibly fast enough
that we can do the whole thing in one cycle.

If we can do the whole multiply in one cycle, then perhaps one
solution is to have separate multiply instructions that emit the upper
and lower parts of the product.  That is, to get both upper and lower
halves, the multiply would be done twice, but we don't care, because
any other approach could be no more efficient.

If the multiply takes, say, two cycles, we'll have to pipeline it.
One approach would be to make the first stage parallel the ALU and the
second stage parallel the MEM stage, with the desired result
multiplexed after the MEM's output register, into WB.  Still then, to
get the upper half, we might as well just do the multiply twice.
There are probably other equally good solutions.

Memory operations are handled by Stage 4, and therefore has access to
the ALU result of Stage 3.  We use the ALU result as the address of
memory operations.  Thus, for each arithmetic instruction, there is one
fetch (QMODE_FETCH) and one store (QMODE_STORE) instruction.  I'll use
"add" as an example, and it's corresponding fetch variant we'll denote
by

    add rX, rcY fetch rZ


MIPS does exactly this.  But I like how you split it out, having the
ALU op in a fixed field and the memory op in another fixed field.
Very RISC-like thinking.  Hypothetically, we could just as well do
other ALU ops to generate an address, while I have the impression that
the MIPS processor hard-codes the ADD operation in the ALU for address
generation.  We'll see if this becomes useful... we should present it
to the programmer as the available set of addressing modes.

    add rX, cY store rZ

Again the instruction format is the same as for "add", except that
QMODE_BITS have the value QMODE_STORE.  The meaning is that rZ is stored
to address rX + cY.  There are, however, some things to note about this.
First, there are 3 inputs here, but the register-fetch can only fetch 2
registers.  Therefore, the store instruction must always be immediate
(cY), so the address is only computed from one register.  Second, this
is the only instruction where rZ is *input*, and write-back is disabled.
This logic you can find in Stage 2, 5, in particular the computation of
iyz and yz_o.


This one is frustrating to me.  We really don't want to have to
multiplex between Y and Z when looking up the register value, just for
this one class of instructions.  We'd prefer to always read X and Y
and always write Z.  But you have the conflicting requirement of
wanting to use only an immediate value for the offset.

Could we perhaps restrict ourselves to using bits 10:0 of the
immediate as the offset in this case?  This way, we read X and Y, add
the offset to whichever is the address, and then write the data.

The thing is, this is probably just pushing complexity off from one
stage onto another.


=== Branching ===

Finally, branches are different from the above, since they don't use the
ALU.  The ALU is skipped here partly because it's very critical to load
the target address back into stage 1 (instruction fetch) as soon as
possible.  That also means the ALU bits in the instruction word can be
used to specify the condition of the branch.  The relevant bits of the
instruction word are


However, I presume that the return address is forwarded through the
ALU to WB so that it appears in the register file.  I guess we'll want
to generate an instruction that splices in the return address and adds
it to zero and is also a no-op for MEM.  We need to diagram this out
so that we know when the return address is actually available to be
used.

The write-back register rZ here gets the next program counter.  For
internal calls, a bit-bucket register can be passed (I suggest r31).
For subroutine calls, rZ will be the register where we pass the
continuation PC (return address).


Usually, the bit-bucket is r0 for MIPS.  Why not do the same?

BTW, I was thinking about how to make r0 always read zero in the FPGA.
We'd really rather not MUX in a zero after the register file.
There's too much MUXing already.  Rather, we'd like the register file
to actually contain a zero in r0.  After some thought, I realized that
we could design WB to write zero to r0 while the system reset is
active and also to cancel any write to r0 the rest of the time.  This
way, the register simply contains a zero, so reading the bit-bucket is
cheap.  Only if we get hit by a cosmic ray do we have a problem.

It's especially important to minimize the number of layers of logic
added between the REG output registers and the ALU output registers.
The ALU is going to be a very heavy-weight thing and will probably be
what most limits our clock rate.

Real code may be a bit premature, but a more complex example is (hope I
got the algorithm right):


Not at all.  Thought experiments using real examples can be very
productive in early design stages.  You'll catch all sorts of
problems.  This reminds me of the "heuristic analysis" encouraged so
much by people in Systems Engineering.

Here's another example to consider implementing:  DMA image upload or
download.  Try to see how we would actually implement it.

Here's the scenario.  Let's say we decide not to go to native PCIe but
rather use a PCI66 to PCIe bridge chip (we have one already
identified).  And let's say that we over-clock it to 75MHz so as to
absorb some of the PCI66 overhead and maximize the throughput through
PCIe.  Let's also say that we manage to get the CPU to run at 100MHz.
That means that if we want to be absolutely sure that we will get the
maximum image transfer throughput, then 3 out of 4 instructions have
to cause I/O shunts from one FIFO straight away into another.  This
immediately implies unrolled loops.  But we also have to deal with
things like address computations and setting up DMA transactions and
dealing with the fact that we can't predict in advance how many words
have come in and how many free words are available in the output fifo.
Oh, and let's also say that we do not have any way for these
instructions to stall the pipeline, so blindly trying to move data
could result in data loss.

How do we do it?  :)

Hint:  Don't try to be efficient all of the time.  The PCI bus won't
be 100% efficient, so although we'll want to try to keep up when data
can be moved, we don't need to when data can't be moved.  So if
there's a gap of even one cycle when you COULD move data but can't
because there is none to move, then feel free to consume multiple
cycles to deal with that case and then catch up later.  My suggestion
would be to find out the number of available words from one fifo, find
out the number of free words in the other fifo, compute the minimum,
and then jump into the right place in an unrolled loop to move that
many words.  Noting that most fifos won't be deeper than 16 entries,
you won't have to unroll much.

--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] The Central Processor

Reply via email to