[Open-graphics] The Central Processor

Petter Urkedal Wed, 28 Mar 2007 09:46:48 -0800

I've had a little off-list discussion with Tim, and we're decided we
will develop on the CPU design I posted earlier.  It may have to be
re-written later if there should be essential features which does not
fit in, but at least then we know what went wrong with the first design.


The current version is attached, along with a code generator and an
assembler.  (Don't mind the assembler too much for the moment, I'll have
a look at how easy it is to build the required library
http://www.eideticdew.org/culibs/ and maybe release a new version.  On
the other hand, the C API for the code generator has no dependencies.)

Now, it's important that the RTL is documented to the point where anyone
with a bit Verilog knowledge and dedication can understand it.
Therefore, it's good if you read and ask.  Also, note that the code is
not yet debugged, and it still lacks register-forwarding.

Then, about feature requests.  Well, maybe we can avoid calling them
that, and instead call them ideas for how to optimise the design.  There
is a trade-off:  If we add more features, we risk having to down-clock
the CPU due to the extra levels of logic.  On the other hand, we don't
want to be forced to use too many instructions in a critical loop due to
a too weak instruction set.  The current design is an attempt to keep a
bare minimum of logic; almost no decoding of the instruction word, few
ALU operations, and only 32 bit signed arithmetic.  From there we can
hopefully enrich it slightly after knowing the time-critical tasks (see
next paragraph), or we'll end up re-writing it.

Before we know CPU is up to the task, we'll have to look closer on what
it will do, and how we'll connect and program it.  Especially we should
identify bottle-necks where we can't afford to use too many
instructions.  Some notes I've made form former posts:

  * Translation for VGA text and graphics modes.
  * Manage data transfers for copies between host memory and graphics
    memory (image up/downloads)
  * Unpack GPU command packets into GPU register writes.
  * Possibly, some 2D commands to save PCI bandwidth, since 3D engine
    needs more setup.
  * Bug work-arounds that would require a lot of PCI overhead.
  * Some management of race conditions.

There are also some general "ABI" decisions:

  * How will we do "multi-tasking"?  One big even-loop?  Interrupts?
  * Will we use a stack for calls, or do we plan carefully how to utilize
    the numerous registers?  Maybe a combination, allocate registers for
    inner calls to small subroutines, and stack otherwise?
  * Communication with periphery.  I think it's already indicated that
    we'll have dedicated circuitry.

I'll go on with an overview of the current CPU.  It should be
accompanied with the Verilog source, and will hopefully help understand
it.  We can go into more detail in follow-up posts.


== Overview of the Processor ==

To repeat the stages
  1. fetch instruction, handle branches as instructed by Stage 2
  2. fetch registers, feed back to 1 for branching
  3. ALU
  4. arithmetic: pass though
     fetch: read memory at address given by ALU
     store: store value rZ at address given by ALU
  5. write-back to register file.  This is in the same module as Stage 2
     because it accesses the same register file.

The top of the Verilog source contains definitions which defines the
instruction format, the main part being:

    // Instruction Format
    //
    `define QMODE_BITS      31:30
    `define QOP_BITS        29:27
    `define QIMM_BIT        26
    `define IZ_BITS         25:21
    `define IX_BITS         20:16
    `define IY_BITS         15:11
    `define IMM_BITS        15:0
    `define IMM_SIGN_BIT    15

These are quite orthogonal and used by different stages.  The main mode of
operation is determined by QMODE_BITS:

    `define QMODE_ARITH   0  // a pure computation on registers
    `define QMODE_FETCH   1  // fetching from local memory
    `define QMODE_STORE   2  // storing to local memory
    `define QMODE_BRANCH  3  // branch on condition, jump to subroutine

This part of the instruction is used by most of the stages.


=== Arithmetic Mode and the ALU ===

Lets first consider QMODE_ARITH.  In this mode, we do a pure
calculation, taking two operands and writing back the result to a
register.  The first operand is always a register, indexed by IX_BITS.
The second operand is either the register indexed by IY_BITS if the
QIMM_BIT is clear, otherwise it is the 32 bit signed extension of the 16
bit value in IMM_BITS.  Note that IY_BITS and IMM_BITS overlap, which is
ok since they are never used simultaneously.  Finally, the result is
written to register indexed by IZ_BITS.

QOP_BITS determine the operation of the ALU, and is active for all modes
except branching.  The operations are

    // Instruction Format: ALU Operations
    `define QOP_AND   0
    `define QOP_OR    1
    `define QOP_XOR   2 // XOR with -1 for NOT.
    `define QOP_SHL   3 // Left shift for rY > 0, right shift for rY < 0
    `define QOP_ADD   4
    `define QOP_RSUB  5 // Reversed args for maximum flexibility.
    `define QOP_MULT  6

If we combine these with the QMODE_ARITH mode, then we get the instructions

    and rX, rcY, rZ     ; rZ := rX & rcY
    or rX, rcY, rZ
    xor rX, rcY, rZ
    lsh rX, rcY, rZ     ; rZ := if rcY < 0 then rX >> -rcY else rX << rcY
    add rX, rcY, rZ     ; rZ := rcY + rZ
    rsub rX, rcY, rZ    ; rZ := rcY - rX
    mult rX, rcY, rZ    ; rZ := rX * rcY        ; Note: always signed!

where rX and rZ are always registers, and rcY may be a register or a
sign-extended 16 bits immediate value.  Notice that in arithmetic mode,
Stage 4 passes the ALU result unchanged to Stage 5, which writes back
the result to the register indexed by the IZ_BITS.


=== Memory Fetch and Store Modes ===

Memory operations are handled by Stage 4, and therefore has access to
the ALU result of Stage 3.  We use the ALU result as the address of
memory operations.  Thus, for each arithmetic instruction, there is one
fetch (QMODE_FETCH) and one store (QMODE_STORE) instruction.  I'll use
"add" as an example, and it's corresponding fetch variant we'll denote
by

    add rX, rcY fetch rZ

This has the same instruction format as "add", except that the
QMODE_BITS are set to QMODE_FETCH.  The effect is to fetch the value at
address rX + rcY into register rZ.  For fetch, Stage 4 takes the ALU
result as address, and passes the looked up location to Stage 5
(write-back).

Similarly we'll write the store instruction as

    add rX, cY store rZ

Again the instruction format is the same as for "add", except that
QMODE_BITS have the value QMODE_STORE.  The meaning is that rZ is stored
to address rX + cY.  There are, however, some things to note about this.
First, there are 3 inputs here, but the register-fetch can only fetch 2
registers.  Therefore, the store instruction must always be immediate
(cY), so the address is only computed from one register.  Second, this
is the only instruction where rZ is *input*, and write-back is disabled.
This logic you can find in Stage 2, 5, in particular the computation of
iyz and yz_o.


=== Branching ===

Finally, branches are different from the above, since they don't use the
ALU.  The ALU is skipped here partly because it's very critical to load
the target address back into stage 1 (instruction fetch) as soon as
possible.  That also means the ALU bits in the instruction word can be
used to specify the condition of the branch.  The relevant bits of the
instruction word are

    `define BNOT_BIT  29  // negates branch condition
    `define BZERO_BIT 28  // branch if zero (can combine with BNEG_BIT)
    `define BNEG_BIT  27  // branch if negative

If we combine these, we get

    000 never branch (noop)         100 branch always (jump)
    001 branch if negative          101 branch if non-negative
    010 branch if zero              110 branch if non-zero
    011 branch if non-positive      111 branch if positive

or the instructions

    bfalse rX, rcY, rZ  ; basically noop, but saves next PC to rZ.
    bneg rX, rcY, rZ
    bzero rX, rcY, rZ
    bnpos rX, rcY, rZ
    btrue rX, rcY, rZ   ; works as both jump and jsr.
    bnneg rX, rcY, rZ
    bnzero rX, rcY, rZ
    bpos rX, rcY, rZ

The write-back register rZ here gets the next program counter.  For
internal calls, a bit-bucket register can be passed (I suggest r31).
For subroutine calls, rZ will be the register where we pass the
continuation PC (return address).

Since it takes one cycle for a branch instruction to pass though Stage
2, before the result needed for deciding the condition is fed back to
Stage 1, there is a 1 cycle delay on branches, so a loop may look like

        btrue loop_test
loop_body:
        ;; ...
loop_test:
        bnzero r0, loop, r31
          add r0, -1, r0    ; always executed before the jump to loop_body

Real code may be a bit premature, but a more complex example is (hope I
got the algorithm right):

        ;; CALL:    pow r0:x r1:y r2:cont r5-r30:cont_data
        ;; EFFECTS: cont r0:result r5-r30:cont_data
pow:
        and r3, 1, r3
        btrue r31, pow__enter, r31
          or r3, 1, r3
pow__loop:
        bzero r4, pow__next, r31
          mult r0, r0, r0       ;; r0 = x↑(2↑i)
        mult r0, r3, r3
pow__next:
        shl r1, -1, r1
pow__enter:
        bnzero r1, pow__loop, r31
          and r1, 1, r4
        btrue r31, r2, r31      ;; call cont with result in r0
          add r3, 0, r0

The generated ROM image for it is (32 bit hex):

    04630001
    a7ff0007
    0c630001
    97e40006
    30000000
    30601800
    1c21ffff
    b7e10003
    04810001
    a3ff1000
    24030000

/*
Copyright 2007, Traversal Technology
Copyright 2007, Petter Urkedal

DUAL LICENSING

(0) This "Work" is defined to be this document or source code, parts of
this document or source code, or derivative works of this document or 
source code.  Use of the Work, in whole or in part, must comply with 
the licensing terms below.

(1) This Work is licensed under the GNU General Public License (GPL) as 
published by the Free Software Foundation; either version 2 of the 
License, or (at your option) any later version.  You have the right to 
use and modify this Work.  If you distribute this work in "binary" form 
(see (7)), you must publish your modified form of this Work in accordance 
with the GPL license.

(2) This Work is also licensed as a proprietary work, all rights
belonging to Traversal Technology.   Traversal Technology may use this
Work under those terms and has the right to publish, license, and sell
this Work and derivative works as they see fit.  To remove these rights, 
you must remove this clause.

(3) Use of this Work without clause (2) forfeits the right to use any 
trademarks owned by Traversal Technology, the Open Graphics Project, or 
related organizations.  This is the only circumstance where modification
of this license by a third party is permitted.

(4) Patches, modifications, changes, and extensions (collectively, 
"Modifications") to this Work that are submitted to the Open Graphics 
Project, the Open Graphics Mailing List, directly to Traversal Technology, 
or to an agent thereof must be SIGNED by the author of said Modification 
(including the words "signed off by" and the author's name), granting 
Traversal Technology copyright privileges under clause (2), as well as 
clause (1).  Unsigned Modifications will be ignored.  Your inclusion of 
the signature implies that you have read and agreed with this license 
and have verified that you are in compliance with applicable law.

(5) Modifications committed directly to an officially recognized source 
code repository are signed implicitly.  Those who have write access to 
such a repository and who commit Modifications to that repository grant 
rights to Traversal Technology under clause (2), as well as clause (1), 
by virtue of having write access and choosing to submit Modifications.

(6) It is the responsibility of the submitter of a Modification to ensure
that they have the right to submit the Modification and that they have
all the necessary permissions (including without limitation, patents
and copyrights) from any other contributors or third parties.

(7) An implementation of this Work that is considered analogous to a 
"binary distribution" is defined as any form that is not easily 
readable by humans ("non-preferred"), which includes, but is not 
limited to:  Fixed-function IC (e.g. ASIC), fixed-function IC masks 
*/


// Instruction Format
//
`define QMODE_BITS      31:30
`define QOP_BITS        29:27
`define QIMM_BIT        26
`define IZ_BITS         25:21
`define IX_BITS         20:16
`define IY_BITS         15:11
`define IMM_BITS        15:0
`define IMM_SIGN_BIT    15

// Instruction Format: Major Operating Mode
//
`define QMODE_ARITH     0
`define QMODE_FETCH     1
`define QMODE_STORE     2
`define QMODE_BRANCH    3
`define QMODE_STORE_OR_BRANCH_BIT 31
`define QMODE_FETCH_OR_BRANCH_BIT 30

// Instruction Format: ALU Operations
`define QOP_AND     0
`define QOP_OR      1
`define QOP_XOR     2   // XOR with -1 for NOT.
`define QOP_SHL     3
`define QOP_ADD     4
`define QOP_RSUB    5   // Reversed args for maximum flexibility.
`define QOP_MULT    6

// Instruction Format: Branch Condition
//
// These overlaps with QOP_BITS which is OK since ALU is not used when
// branching.
//
`define BNOT_BIT        29    // negates branch condition
`define BZERO_BIT       28    // branch if zero (can combine with BNEG_BIT)
`define BNEG_BIT        27    // branch if negative
//
// The branch conditions can be expanded to:
//
//   3'b000 noop                        3'b100 branch always (jump)
//   3'b001 branch if negative          3'b101 branch if non-negative
//   3'b010 branch if zero              3'b110 branch if non-zero
//   3'b011 branch if non-positive      3'b111 branch if positive

`define IREG_BITBUCKET  15
`define START_ADDR      0
`define SAFE_INSN       32'h00000000


// The Top-Level CPU Module
// ========================
//
module oga1cp(clock, reset_, clock_2x, phase,
              do_upload, upload_addr, upload_data);
//> Standard signals.
input clock;
input reset_;
//> Clock running at twice the frequency of a[clock], with rising edge aligned
//  with rising edge of a[clock].
input clock_2x;
//> A register which must be asserted on rising a[clock] and deasserted on
//  falling a[clock].  That is, it's signal shape is mostly identical to
//  a[clock], except for the small detail of being registered.
input phase;

//> On each rising clock edge where a[do_upload] is high, the value
//  a[upload_data] is written to a[upload_addr].  The CPU will start running
//  from address 0 on the first clock cycle after a[do_upload] goes low.
input do_upload;
input[8:0] upload_addr;
input[31:0] upload_data;

// -- end interface --

wire[31:0] s1_insn, s2_insn, s3_insn;
wire[8:0] s1_pc;
wire[31:0] s2_x;
wire[31:0] s2_y, s2_store_val;
wire[31:0] s3_store_val;
wire[4:0] s4_wb_reg;
wire[31:0] s3_z, s4_wb_val;
wire s4_wb_enable;

oga1cp_stg1_fetch stg1(
    clock, reset_,
    do_upload, upload_addr, upload_data,
    s1_insn, s1_pc,
    s2_insn[`BNOT_BIT], s2_insn[`BZERO_BIT], s2_insn[`BNEG_BIT], s2_x, s2_y);

oga1cp_stg2_regio stg2(
    .clock(clock), .reset_(reset_), .clock_2x(clock_2x), .phase(phase),
    .insn(s1_insn), .pc(s1_pc),
    .insn_o(s2_insn), .x_o(s2_x), .y_o(s2_y), .yz_o(s2_store_val),
    .wb_enable(s4_wb_enable), .wb_reg(s4_wb_reg), .wb_val(s4_wb_val));

oga1cp_stg3_alu stg3(
    clock, reset_,
    s2_insn, s2_x, s2_y, s2_store_val,
    s3_insn, s3_z, s3_store_val);

oga1cp_stg4_memio stg4(
    clock, reset_,
    s3_insn, s3_z, s3_store_val,
    s4_wb_enable, s4_wb_reg, s4_wb_val);

endmodule


// Stage 1: Instruction Fetch
// ==========================
//
module oga1cp_stg1_fetch(clock, reset_, do_upload, upload_addr, upload_data,
                         insn_o, pc_o,
                         s2_bnot, s2_bzero, s2_bneg, s2_x, s2_y);
//> Standard signals.
input clock;
input reset_;

//> On rising edge where a[do_upload] is non-zero, a[upload_data] is
//  stoned at a[upload_addr] in program memory.
input do_upload;
input[8:0] upload_addr;
input[31:0] upload_data;

//> The current instruction.
output[31:0] insn_o;
//> The address just after the current instruction.
output[8:0] pc_o;

//> On rising edge when the instruction at stage 2 is a branch, a
//  conditional branch is done.  If a[s2_bnot] is 0, the condition
//  is that a[s2_bzero] is 1 and a[s2_x] is zero or a[s2_bneg] is 1 and
//  a[s2_x] is negative.  If a[s2_bnot] is 1, the condition is negated.
input s2_bnot;
input s2_bzero;
input s2_bneg;

//> Test register for branch.
input[31:0] s2_x;

//> The target address for branch.
input[31:0] s2_y;

// -- end interface --

reg s2_insn_is_branch;
wire is_zero = s2_x == 0;
wire is_neg = s2_x[31];
wire do_branch = s2_insn_is_branch
        && s2_bnot !== (is_zero == s2_bzero && is_neg == s2_bneg);

reg [8:0] pc_o;
wire [8:0] next_pc = do_branch ? s2_y : pc_o;

always @(posedge clock or negedge reset_)
    if (!reset_ || do_upload) begin
        pc_o <= `START_ADDR - 1;
        s2_insn_is_branch <= 0;
    end else begin
        pc_o <= next_pc + 1;
        s2_insn_is_branch <= insn_o[`QMODE_BITS] == `QMODE_BRANCH;
    end

RAMB16_S36_S36 program_memory (
    .CLKA(clock), .SSRA(1'b0),
    .ADDRA(next_pc),
    .ENA(1'b1),         .DOA(insn_o),           .DOPA(),
    .WEA(1'b0),         .DIA(32'b0),            .DIPA(4'b0),

    .CLKB(clock), .SSRB(1'b0),
    .ADDRB(upload_addr),
    .ENB(1'b1),         .DOB(),                 .DOPB(),
    .WEB(do_upload),    .DIB(upload_data),      .DIPB(4'b0));

endmodule


// Stages 2, 5: Register Fetch and Write-Back
// ==========================================
//
module oga1cp_stg2_regio(clock, reset_, clock_2x, phase,
                         insn, pc,
                         insn_o, x_o, y_o, yz_o,
                         wb_enable, wb_reg, wb_val);
//> Standard signals.
input clock;
input reset_;
//> See the corresponding signals in ref[oga1cp].
input clock_2x;
input phase;

//> The instruction from stage 1.
input[31:0] insn;
//> The program counter just after a[insn].
input[8:0] pc;

//> The a[insn] delayed by 1 clock cycle.
output[31:0] insn_o;
//> The register value of the first operand according to a[insn_o].
output[31:0] x_o;
//> The second operand according to a[insn_o].  If the c[`QIMM_BIT] is set,
//  this is the immedite value c[insn[`IMM_BITS]], otherwise it's a register.
output[31:0] y_o;
//> Register y if y is not immediate, otherwise register z.  This is passed
//  unchanged to stage 4, where it is used as the output value for store
//  operations.  Since store operations are always immediate, this is in
//  pracise always register z.
output[31:0] yz_o;

//> On a falling clock cycle when a[wb_enable] is asserted, a[wb_val] is
//  stored to register a[wb_val].
input wb_enable;
input[4:0] wb_reg;
input[31:0] wb_val;

// -- end interface --

reg[31:0] x_o, yz_o;
reg[31:0] insn_o;
wire[31:0] y_out_if_imm;

wire[4:0] ix = insn[`IX_BITS];
wire[4:0] iyz = insn[`QIMM_BIT]? insn[`IZ_BITS] : insn[`IY_BITS];

always @(posedge clock or negedge reset_)
    if (!reset_) insn_o <= `SAFE_INSN;
    else         insn_o <= insn;

reg[31:0] regfile[0:31];
always @(posedge clock_2x)
    if (phase == 1) begin // falling edge of clock (right?)
        if (wb_enable)
            regfile[wb_reg] <= wb_val;
    end else begin
        if (insn[`QMODE_BITS] == `QMODE_BRANCH)
            x_o <= pc + 1;
        else
            x_o <= regfile[ix];
        yz_o <= regfile[iyz];
    end

assign y_out_if_imm = {{16{insn_o[`IMM_SIGN_BIT]}}, insn_o[`IMM_BITS]};
assign y_o = insn_o[`QIMM_BIT]? y_out_if_imm : yz_o;

endmodule


// Stage 3: ALU
// ============
//
module oga1cp_stg3_alu(clock, reset_, insn, x, y, yz, insn_o, res_o, yz_o);
//> Standard signals.
input clock;
input reset_;

//> The instruction from stage 2.
input[31:0] insn;
//> The operands.
input[31:0] x;
input[31:0] y;
//> Value to be passed though.  It is used by store.
input[31:0] yz;

//> The one-cycle delayd a[insn].
output[31:0] insn_o;
//> The computed value, except:  For branching, the value a[x] is passed
//  though unchanged.  This is the program counter.
output[31:0] res_o;
//> Passed though a[yz], delayed one clock cycle.
output[31:0] yz_o;

// -- end interface --

reg[31:0] res_o, yz_o;
reg[31:0] insn_o;

always @(posedge clock or negedge reset_)
    if (!reset_)
        insn_o <= `SAFE_INSN;
    else begin
        yz_o <= yz;
        insn_o <= insn;
        if (insn[`QMODE_BITS] == `QMODE_BRANCH) begin
            res_o <= x;
        end else case (insn[`QOP_BITS])
            `QOP_AND:  res_o <= x & y;
            `QOP_OR:   res_o <= x | y;
            `QOP_XOR:  res_o <= x ^ y;
            `QOP_SHL:  res_o <= x << y;// FIXME. Can we have signed y?
            `QOP_ADD:  res_o <= y + x; // FIXME. Explicit instatiate ADDSUB
            `QOP_RSUB: res_o <= y - x; // ...
            `QOP_MULT: res_o <= x * y; // FIXME. Explicit instatiate.
        endcase
    end

endmodule


// Stage 4: The Memory Access Stage
// ================================
//
module oga1cp_stg4_memio(clock, reset_, insn, val_or_addr, store_val,
                         wb_enable_o, wb_reg_o, wb_val_o);
//> Standard signals.
input clock;
input reset_;

//> Instruction from stage 3.
input[31:0] insn;
//> The result of stage 3, to be used as address for fetch/store or else
//  passed though to the write-back.
input[31:0] val_or_addr;
//> For store, the value written to a[val_or_addr].  Not used for other modes.
input[31:0] store_val;

//> Control of register write-back.
output wb_enable_o;
output[4:0] wb_reg_o;
output[31:0] wb_val_o;

// -- end interface --

wire is_store = insn[`QMODE_BITS] == `QMODE_STORE;
reg is_fetch;

reg wb_enable_o;
reg[31:0] wb_val_if_not_fetch;
wire[31:0] wb_val_if_fetch;
reg[4:0] wb_reg_o;

always @(posedge clock) begin
    is_fetch <= insn[`QMODE_BITS] == `QMODE_FETCH;
    wb_val_if_not_fetch <= val_or_addr;
    wb_enable_o <= !is_store;
    wb_reg_o <= insn[`IZ_BITS];
end

assign wb_val_o = is_fetch? wb_val_if_fetch : wb_val_if_not_fetch;

RAMB16_S36_S36 local_memory(
    .CLKA(clock), .SSRA(1'b0),
    .ADDRA(val_or_addr[8:0]),
    .ENA(1'b1),         .DOA(wb_val_if_fetch),  .DOPA(),
    .WEA(is_store),     .DIA(store_val),        .DIPA(4'b0),

    .CLKB(1'b0), .SSRB(1'b0),
    .ADDRB(9'b0),
    .ENB(1'b0),         .DOB(),                 .DOPB(),
    .WEB(1'b0),         .DIB(32'b0),            .DIPB(4'b0));

endmodule

oga1cp-asm.tar.bz2
Description: BZip2 compressed data

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

[Open-graphics] The Central Processor

Reply via email to