Timothy Normand Miller wrote:
It won't be long before we'll have to design a nanocontroller for OGD1
to manage VGA and DMA.  I may be able to just go off and design one
myself, but I think that many of you would fancy observing and
participating in the design process, and with more brains on it, we'd
do a better job.

I will try to help with this. Although I have to report that although I understand the basics of Verilog that I haven't had the same success with a self taught crash course as I did in PostScript. :-)

You don't need chip design knowledge for this.  You just have
understand logic, have some familiarity with assembly programming, and
have a sense for the parallelism that goes on in a chip.  It's as
though you wrote a C program where every function in your program runs
simultaneously with every other function in numerous threads.

How about we get started with a high-level overview.  A pipelined RISC
processor is broken up into stages.  A stage does its work and passes
its results on to the next stage while at the same time accepting new
work from the stage preceeding.  In steady state, as long as a stream
of work is available, then all stages in the pipeline are doing useful
work, with earlier stages working on earlier instructions.

If you want a good textbook on this, look for "Computer Architecture:
A Quantitative Approach" by Hennessy and Patterson.

Here's how we'll break up our processor pipeline, deviating slightly
from the MIPS template described in that book.

(1) Instruction fetch
Here, you have an instruction pointer that indicates the address of
the next instruction to execute.  In our processor, our instructions
are stored in a local static RAM inside of the FPGA, so there is no
need for any sort of "cache miss" logic.  With an address, you are
guaranteed to get an instruction immediately on the next cycle.  Our
instructions are 32 bits wide.  (We could go to 36 bits if we find it
helpful.)

That might be useful if we need to have multiple types of instructions.

(2) Instruction decode and register access
One of the main principles behind RISC processors is making
instruction decode absolutely trivial.  The instruction is broken up
into fixed fields holding register numbers, and all instructions are
structured this way.  What that means for us is that we can take
source operand straight out of the instruction and use those as
indexes into our register file, with no logic in between.  We'll have
32 registers, so we need 5-bit fields.  We need fields in the
instruction for two source operands and one destination operand (that
will get used in a later pipeline stage).

Isn't it more usual to have a load store architecture with 2 address operations? OTOH, there are supposed advantages to register set processors (no accumulators).

This is also where we need to deal with branches.  If the instruction
is a branch, the condition needs to be resolved, and the address needs
to be fed back to stage (1).  This is why RISC processors typically
have a delayed branch.  The possible branch conditions are reg-value=0
and reg-value!=0.

Were you planing to have a skip on condition instruction?

(3) ALU
Here, the numbers fetched from registers in stage (2) are combined
based on an opcode in the instruction.  ALU operations include add,
subtract, shift, multiply (using dedicated multiplier logic), and
bitwise logical operations.

We may implement what's called result forwarding.  Since the flow of
data through the processor is completely deterministic, then we can
figure out which pipeline stage has an ALU result before the result
has made it to the register file in stage 5.  This way, you can use as
a source operand in one instruction what was the target of the
immediately preceeding instruction.

The MIPS processor stays simple by not having any result flags.  That
is in an x86 processor, math instructions yield carry, zero, negative,
and overflow flags (among others).  MIPS doesn't do that, because it
causes all sorts of challenging dependencies.  You're better off using
a few extra instructions and having a processor that's simpler and
faster for everything else.

Yes, the saving the state beyond the next instruction causes real problems. Exception is that if the next instruction is a BRANCH then you keep it for another cycle. That allows for multiple way branches.

Comparisons are done in the ALU.  The subtract instruction is used for
equal/not-equal comparisons.  In addition, we'll provide signed and
unsigned less-than instructions.  With these three instructions, you
can get any of the usual comparisons that you want to make.  The
result of the comparison is dumped into a register, just like the
result of any math operation, and used by the conditional branch
instruction that compares it to zero.  That means we "waste" a whole
32-bit register for what is really only a single-bit result.  But that
approach saves us logic in the long-run.

You could use more than one bit and then have a branch instruction use these.

        COMPU   R11, R12
        BOC     <condition>, <location>

COMPU is compare unsigned.  You also need COMPS and COMPF.

BOC the condition code (4 bits) is compared to the bits in the condition register branch is taken if any of them match. BONC condition doesn't branch if the condition is met.

(4) Memory access and I/O
This is the stage where we take an address computed above and read or
write our local memory.  Our "local" memory is actually another
512-word block RAM, that we'll use as scratch space.

I believe the MIPS processor uses the ALU to add the contents of one
register to a short immediate value stored in the instruction, and
that's used as the address.  We should do the same.  That makes it so
that the only memory addressing mode is reg-value + offset.

I don't see how this helps. You can't presume that the register with the base address is available until the start of the execution stage of the pipeline so instruction fetch is going to take two cycles. One to do the add and a second to do the fetch. If we eliminate the offset we would gain speed (one clock) when an offset wasn't required, break even part of the time (sequential reads on a base address) and loose when we were doing random access based on a base register.

In addition, this is also the stage where we'll want to do other
I/O-related operations, such as providing access to real graphics
memory and controlling other aspects of the GPU that are accessible by
this processor.  We'll make that available, to appear as another
512-word space (or more or less as necessary) or read-only and
write-only "memory locations".

We'll treat graphics memory access as though we're controlling some
other device.  Writes involve dropping a pair of words (address, data)
into a queue.  Reads involve dropping a word (address) into a queue
and them some time later, popping the read data out of another queue.
Those queues will show up as "memory addresses" to the CPU.  In fact,
the CPU will control quite a number of things by writing/reading
queues.

(5) Register write-back
The register file read in stage 2 is actually double-pumped.  It runs
at double the clock rate of the rest of the processor.  On the first
half clock cycle, we perform writes.  On the second half, we perform
reads.  In fact, you might say that stages (2) and (5) are really
parts of the same stage.

You mean (for example) that write occurs on the rising clock and read occurs on the falling clock?

--
JRT
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to