I've had a little off-list discussion with Tim, and we're decided we will develop on the CPU design I posted earlier. It may have to be re-written later if there should be essential features which does not fit in, but at least then we know what went wrong with the first design.
The current version is attached, along with a code generator and an assembler. (Don't mind the assembler too much for the moment, I'll have a look at how easy it is to build the required library http://www.eideticdew.org/culibs/ and maybe release a new version. On the other hand, the C API for the code generator has no dependencies.) Now, it's important that the RTL is documented to the point where anyone with a bit Verilog knowledge and dedication can understand it. Therefore, it's good if you read and ask. Also, note that the code is not yet debugged, and it still lacks register-forwarding. Then, about feature requests. Well, maybe we can avoid calling them that, and instead call them ideas for how to optimise the design. There is a trade-off: If we add more features, we risk having to down-clock the CPU due to the extra levels of logic. On the other hand, we don't want to be forced to use too many instructions in a critical loop due to a too weak instruction set. The current design is an attempt to keep a bare minimum of logic; almost no decoding of the instruction word, few ALU operations, and only 32 bit signed arithmetic. From there we can hopefully enrich it slightly after knowing the time-critical tasks (see next paragraph), or we'll end up re-writing it. Before we know CPU is up to the task, we'll have to look closer on what it will do, and how we'll connect and program it. Especially we should identify bottle-necks where we can't afford to use too many instructions. Some notes I've made form former posts: * Translation for VGA text and graphics modes. * Manage data transfers for copies between host memory and graphics memory (image up/downloads) * Unpack GPU command packets into GPU register writes. * Possibly, some 2D commands to save PCI bandwidth, since 3D engine needs more setup. * Bug work-arounds that would require a lot of PCI overhead. * Some management of race conditions. There are also some general "ABI" decisions: * How will we do "multi-tasking"? One big even-loop? Interrupts? * Will we use a stack for calls, or do we plan carefully how to utilize the numerous registers? Maybe a combination, allocate registers for inner calls to small subroutines, and stack otherwise? * Communication with periphery. I think it's already indicated that we'll have dedicated circuitry. I'll go on with an overview of the current CPU. It should be accompanied with the Verilog source, and will hopefully help understand it. We can go into more detail in follow-up posts. == Overview of the Processor == To repeat the stages 1. fetch instruction, handle branches as instructed by Stage 2 2. fetch registers, feed back to 1 for branching 3. ALU 4. arithmetic: pass though fetch: read memory at address given by ALU store: store value rZ at address given by ALU 5. write-back to register file. This is in the same module as Stage 2 because it accesses the same register file. The top of the Verilog source contains definitions which defines the instruction format, the main part being: // Instruction Format // `define QMODE_BITS 31:30 `define QOP_BITS 29:27 `define QIMM_BIT 26 `define IZ_BITS 25:21 `define IX_BITS 20:16 `define IY_BITS 15:11 `define IMM_BITS 15:0 `define IMM_SIGN_BIT 15 These are quite orthogonal and used by different stages. The main mode of operation is determined by QMODE_BITS: `define QMODE_ARITH 0 // a pure computation on registers `define QMODE_FETCH 1 // fetching from local memory `define QMODE_STORE 2 // storing to local memory `define QMODE_BRANCH 3 // branch on condition, jump to subroutine This part of the instruction is used by most of the stages. === Arithmetic Mode and the ALU === Lets first consider QMODE_ARITH. In this mode, we do a pure calculation, taking two operands and writing back the result to a register. The first operand is always a register, indexed by IX_BITS. The second operand is either the register indexed by IY_BITS if the QIMM_BIT is clear, otherwise it is the 32 bit signed extension of the 16 bit value in IMM_BITS. Note that IY_BITS and IMM_BITS overlap, which is ok since they are never used simultaneously. Finally, the result is written to register indexed by IZ_BITS. QOP_BITS determine the operation of the ALU, and is active for all modes except branching. The operations are // Instruction Format: ALU Operations `define QOP_AND 0 `define QOP_OR 1 `define QOP_XOR 2 // XOR with -1 for NOT. `define QOP_SHL 3 // Left shift for rY > 0, right shift for rY < 0 `define QOP_ADD 4 `define QOP_RSUB 5 // Reversed args for maximum flexibility. `define QOP_MULT 6 If we combine these with the QMODE_ARITH mode, then we get the instructions and rX, rcY, rZ ; rZ := rX & rcY or rX, rcY, rZ xor rX, rcY, rZ lsh rX, rcY, rZ ; rZ := if rcY < 0 then rX >> -rcY else rX << rcY add rX, rcY, rZ ; rZ := rcY + rZ rsub rX, rcY, rZ ; rZ := rcY - rX mult rX, rcY, rZ ; rZ := rX * rcY ; Note: always signed! where rX and rZ are always registers, and rcY may be a register or a sign-extended 16 bits immediate value. Notice that in arithmetic mode, Stage 4 passes the ALU result unchanged to Stage 5, which writes back the result to the register indexed by the IZ_BITS. === Memory Fetch and Store Modes === Memory operations are handled by Stage 4, and therefore has access to the ALU result of Stage 3. We use the ALU result as the address of memory operations. Thus, for each arithmetic instruction, there is one fetch (QMODE_FETCH) and one store (QMODE_STORE) instruction. I'll use "add" as an example, and it's corresponding fetch variant we'll denote by add rX, rcY fetch rZ This has the same instruction format as "add", except that the QMODE_BITS are set to QMODE_FETCH. The effect is to fetch the value at address rX + rcY into register rZ. For fetch, Stage 4 takes the ALU result as address, and passes the looked up location to Stage 5 (write-back). Similarly we'll write the store instruction as add rX, cY store rZ Again the instruction format is the same as for "add", except that QMODE_BITS have the value QMODE_STORE. The meaning is that rZ is stored to address rX + cY. There are, however, some things to note about this. First, there are 3 inputs here, but the register-fetch can only fetch 2 registers. Therefore, the store instruction must always be immediate (cY), so the address is only computed from one register. Second, this is the only instruction where rZ is *input*, and write-back is disabled. This logic you can find in Stage 2, 5, in particular the computation of iyz and yz_o. === Branching === Finally, branches are different from the above, since they don't use the ALU. The ALU is skipped here partly because it's very critical to load the target address back into stage 1 (instruction fetch) as soon as possible. That also means the ALU bits in the instruction word can be used to specify the condition of the branch. The relevant bits of the instruction word are `define BNOT_BIT 29 // negates branch condition `define BZERO_BIT 28 // branch if zero (can combine with BNEG_BIT) `define BNEG_BIT 27 // branch if negative If we combine these, we get 000 never branch (noop) 100 branch always (jump) 001 branch if negative 101 branch if non-negative 010 branch if zero 110 branch if non-zero 011 branch if non-positive 111 branch if positive or the instructions bfalse rX, rcY, rZ ; basically noop, but saves next PC to rZ. bneg rX, rcY, rZ bzero rX, rcY, rZ bnpos rX, rcY, rZ btrue rX, rcY, rZ ; works as both jump and jsr. bnneg rX, rcY, rZ bnzero rX, rcY, rZ bpos rX, rcY, rZ The write-back register rZ here gets the next program counter. For internal calls, a bit-bucket register can be passed (I suggest r31). For subroutine calls, rZ will be the register where we pass the continuation PC (return address). Since it takes one cycle for a branch instruction to pass though Stage 2, before the result needed for deciding the condition is fed back to Stage 1, there is a 1 cycle delay on branches, so a loop may look like btrue loop_test loop_body: ;; ... loop_test: bnzero r0, loop, r31 add r0, -1, r0 ; always executed before the jump to loop_body Real code may be a bit premature, but a more complex example is (hope I got the algorithm right): ;; CALL: pow r0:x r1:y r2:cont r5-r30:cont_data ;; EFFECTS: cont r0:result r5-r30:cont_data pow: and r3, 1, r3 btrue r31, pow__enter, r31 or r3, 1, r3 pow__loop: bzero r4, pow__next, r31 mult r0, r0, r0 ;; r0 = x↑(2↑i) mult r0, r3, r3 pow__next: shl r1, -1, r1 pow__enter: bnzero r1, pow__loop, r31 and r1, 1, r4 btrue r31, r2, r31 ;; call cont with result in r0 add r3, 0, r0 The generated ROM image for it is (32 bit hex): 04630001 a7ff0007 0c630001 97e40006 30000000 30601800 1c21ffff b7e10003 04810001 a3ff1000 24030000
/*
Copyright 2007, Traversal Technology
Copyright 2007, Petter Urkedal
DUAL LICENSING
(0) This "Work" is defined to be this document or source code, parts of
this document or source code, or derivative works of this document or
source code. Use of the Work, in whole or in part, must comply with
the licensing terms below.
(1) This Work is licensed under the GNU General Public License (GPL) as
published by the Free Software Foundation; either version 2 of the
License, or (at your option) any later version. You have the right to
use and modify this Work. If you distribute this work in "binary" form
(see (7)), you must publish your modified form of this Work in accordance
with the GPL license.
(2) This Work is also licensed as a proprietary work, all rights
belonging to Traversal Technology. Traversal Technology may use this
Work under those terms and has the right to publish, license, and sell
this Work and derivative works as they see fit. To remove these rights,
you must remove this clause.
(3) Use of this Work without clause (2) forfeits the right to use any
trademarks owned by Traversal Technology, the Open Graphics Project, or
related organizations. This is the only circumstance where modification
of this license by a third party is permitted.
(4) Patches, modifications, changes, and extensions (collectively,
"Modifications") to this Work that are submitted to the Open Graphics
Project, the Open Graphics Mailing List, directly to Traversal Technology,
or to an agent thereof must be SIGNED by the author of said Modification
(including the words "signed off by" and the author's name), granting
Traversal Technology copyright privileges under clause (2), as well as
clause (1). Unsigned Modifications will be ignored. Your inclusion of
the signature implies that you have read and agreed with this license
and have verified that you are in compliance with applicable law.
(5) Modifications committed directly to an officially recognized source
code repository are signed implicitly. Those who have write access to
such a repository and who commit Modifications to that repository grant
rights to Traversal Technology under clause (2), as well as clause (1),
by virtue of having write access and choosing to submit Modifications.
(6) It is the responsibility of the submitter of a Modification to ensure
that they have the right to submit the Modification and that they have
all the necessary permissions (including without limitation, patents
and copyrights) from any other contributors or third parties.
(7) An implementation of this Work that is considered analogous to a
"binary distribution" is defined as any form that is not easily
readable by humans ("non-preferred"), which includes, but is not
limited to: Fixed-function IC (e.g. ASIC), fixed-function IC masks
*/
// Instruction Format
//
`define QMODE_BITS 31:30
`define QOP_BITS 29:27
`define QIMM_BIT 26
`define IZ_BITS 25:21
`define IX_BITS 20:16
`define IY_BITS 15:11
`define IMM_BITS 15:0
`define IMM_SIGN_BIT 15
// Instruction Format: Major Operating Mode
//
`define QMODE_ARITH 0
`define QMODE_FETCH 1
`define QMODE_STORE 2
`define QMODE_BRANCH 3
`define QMODE_STORE_OR_BRANCH_BIT 31
`define QMODE_FETCH_OR_BRANCH_BIT 30
// Instruction Format: ALU Operations
`define QOP_AND 0
`define QOP_OR 1
`define QOP_XOR 2 // XOR with -1 for NOT.
`define QOP_SHL 3
`define QOP_ADD 4
`define QOP_RSUB 5 // Reversed args for maximum flexibility.
`define QOP_MULT 6
// Instruction Format: Branch Condition
//
// These overlaps with QOP_BITS which is OK since ALU is not used when
// branching.
//
`define BNOT_BIT 29 // negates branch condition
`define BZERO_BIT 28 // branch if zero (can combine with BNEG_BIT)
`define BNEG_BIT 27 // branch if negative
//
// The branch conditions can be expanded to:
//
// 3'b000 noop 3'b100 branch always (jump)
// 3'b001 branch if negative 3'b101 branch if non-negative
// 3'b010 branch if zero 3'b110 branch if non-zero
// 3'b011 branch if non-positive 3'b111 branch if positive
`define IREG_BITBUCKET 15
`define START_ADDR 0
`define SAFE_INSN 32'h00000000
// The Top-Level CPU Module
// ========================
//
module oga1cp(clock, reset_, clock_2x, phase,
do_upload, upload_addr, upload_data);
//> Standard signals.
input clock;
input reset_;
//> Clock running at twice the frequency of a[clock], with rising edge aligned
// with rising edge of a[clock].
input clock_2x;
//> A register which must be asserted on rising a[clock] and deasserted on
// falling a[clock]. That is, it's signal shape is mostly identical to
// a[clock], except for the small detail of being registered.
input phase;
//> On each rising clock edge where a[do_upload] is high, the value
// a[upload_data] is written to a[upload_addr]. The CPU will start running
// from address 0 on the first clock cycle after a[do_upload] goes low.
input do_upload;
input[8:0] upload_addr;
input[31:0] upload_data;
// -- end interface --
wire[31:0] s1_insn, s2_insn, s3_insn;
wire[8:0] s1_pc;
wire[31:0] s2_x;
wire[31:0] s2_y, s2_store_val;
wire[31:0] s3_store_val;
wire[4:0] s4_wb_reg;
wire[31:0] s3_z, s4_wb_val;
wire s4_wb_enable;
oga1cp_stg1_fetch stg1(
clock, reset_,
do_upload, upload_addr, upload_data,
s1_insn, s1_pc,
s2_insn[`BNOT_BIT], s2_insn[`BZERO_BIT], s2_insn[`BNEG_BIT], s2_x, s2_y);
oga1cp_stg2_regio stg2(
.clock(clock), .reset_(reset_), .clock_2x(clock_2x), .phase(phase),
.insn(s1_insn), .pc(s1_pc),
.insn_o(s2_insn), .x_o(s2_x), .y_o(s2_y), .yz_o(s2_store_val),
.wb_enable(s4_wb_enable), .wb_reg(s4_wb_reg), .wb_val(s4_wb_val));
oga1cp_stg3_alu stg3(
clock, reset_,
s2_insn, s2_x, s2_y, s2_store_val,
s3_insn, s3_z, s3_store_val);
oga1cp_stg4_memio stg4(
clock, reset_,
s3_insn, s3_z, s3_store_val,
s4_wb_enable, s4_wb_reg, s4_wb_val);
endmodule
// Stage 1: Instruction Fetch
// ==========================
//
module oga1cp_stg1_fetch(clock, reset_, do_upload, upload_addr, upload_data,
insn_o, pc_o,
s2_bnot, s2_bzero, s2_bneg, s2_x, s2_y);
//> Standard signals.
input clock;
input reset_;
//> On rising edge where a[do_upload] is non-zero, a[upload_data] is
// stoned at a[upload_addr] in program memory.
input do_upload;
input[8:0] upload_addr;
input[31:0] upload_data;
//> The current instruction.
output[31:0] insn_o;
//> The address just after the current instruction.
output[8:0] pc_o;
//> On rising edge when the instruction at stage 2 is a branch, a
// conditional branch is done. If a[s2_bnot] is 0, the condition
// is that a[s2_bzero] is 1 and a[s2_x] is zero or a[s2_bneg] is 1 and
// a[s2_x] is negative. If a[s2_bnot] is 1, the condition is negated.
input s2_bnot;
input s2_bzero;
input s2_bneg;
//> Test register for branch.
input[31:0] s2_x;
//> The target address for branch.
input[31:0] s2_y;
// -- end interface --
reg s2_insn_is_branch;
wire is_zero = s2_x == 0;
wire is_neg = s2_x[31];
wire do_branch = s2_insn_is_branch
&& s2_bnot !== (is_zero == s2_bzero && is_neg == s2_bneg);
reg [8:0] pc_o;
wire [8:0] next_pc = do_branch ? s2_y : pc_o;
always @(posedge clock or negedge reset_)
if (!reset_ || do_upload) begin
pc_o <= `START_ADDR - 1;
s2_insn_is_branch <= 0;
end else begin
pc_o <= next_pc + 1;
s2_insn_is_branch <= insn_o[`QMODE_BITS] == `QMODE_BRANCH;
end
RAMB16_S36_S36 program_memory (
.CLKA(clock), .SSRA(1'b0),
.ADDRA(next_pc),
.ENA(1'b1), .DOA(insn_o), .DOPA(),
.WEA(1'b0), .DIA(32'b0), .DIPA(4'b0),
.CLKB(clock), .SSRB(1'b0),
.ADDRB(upload_addr),
.ENB(1'b1), .DOB(), .DOPB(),
.WEB(do_upload), .DIB(upload_data), .DIPB(4'b0));
endmodule
// Stages 2, 5: Register Fetch and Write-Back
// ==========================================
//
module oga1cp_stg2_regio(clock, reset_, clock_2x, phase,
insn, pc,
insn_o, x_o, y_o, yz_o,
wb_enable, wb_reg, wb_val);
//> Standard signals.
input clock;
input reset_;
//> See the corresponding signals in ref[oga1cp].
input clock_2x;
input phase;
//> The instruction from stage 1.
input[31:0] insn;
//> The program counter just after a[insn].
input[8:0] pc;
//> The a[insn] delayed by 1 clock cycle.
output[31:0] insn_o;
//> The register value of the first operand according to a[insn_o].
output[31:0] x_o;
//> The second operand according to a[insn_o]. If the c[`QIMM_BIT] is set,
// this is the immedite value c[insn[`IMM_BITS]], otherwise it's a register.
output[31:0] y_o;
//> Register y if y is not immediate, otherwise register z. This is passed
// unchanged to stage 4, where it is used as the output value for store
// operations. Since store operations are always immediate, this is in
// pracise always register z.
output[31:0] yz_o;
//> On a falling clock cycle when a[wb_enable] is asserted, a[wb_val] is
// stored to register a[wb_val].
input wb_enable;
input[4:0] wb_reg;
input[31:0] wb_val;
// -- end interface --
reg[31:0] x_o, yz_o;
reg[31:0] insn_o;
wire[31:0] y_out_if_imm;
wire[4:0] ix = insn[`IX_BITS];
wire[4:0] iyz = insn[`QIMM_BIT]? insn[`IZ_BITS] : insn[`IY_BITS];
always @(posedge clock or negedge reset_)
if (!reset_) insn_o <= `SAFE_INSN;
else insn_o <= insn;
reg[31:0] regfile[0:31];
always @(posedge clock_2x)
if (phase == 1) begin // falling edge of clock (right?)
if (wb_enable)
regfile[wb_reg] <= wb_val;
end else begin
if (insn[`QMODE_BITS] == `QMODE_BRANCH)
x_o <= pc + 1;
else
x_o <= regfile[ix];
yz_o <= regfile[iyz];
end
assign y_out_if_imm = {{16{insn_o[`IMM_SIGN_BIT]}}, insn_o[`IMM_BITS]};
assign y_o = insn_o[`QIMM_BIT]? y_out_if_imm : yz_o;
endmodule
// Stage 3: ALU
// ============
//
module oga1cp_stg3_alu(clock, reset_, insn, x, y, yz, insn_o, res_o, yz_o);
//> Standard signals.
input clock;
input reset_;
//> The instruction from stage 2.
input[31:0] insn;
//> The operands.
input[31:0] x;
input[31:0] y;
//> Value to be passed though. It is used by store.
input[31:0] yz;
//> The one-cycle delayd a[insn].
output[31:0] insn_o;
//> The computed value, except: For branching, the value a[x] is passed
// though unchanged. This is the program counter.
output[31:0] res_o;
//> Passed though a[yz], delayed one clock cycle.
output[31:0] yz_o;
// -- end interface --
reg[31:0] res_o, yz_o;
reg[31:0] insn_o;
always @(posedge clock or negedge reset_)
if (!reset_)
insn_o <= `SAFE_INSN;
else begin
yz_o <= yz;
insn_o <= insn;
if (insn[`QMODE_BITS] == `QMODE_BRANCH) begin
res_o <= x;
end else case (insn[`QOP_BITS])
`QOP_AND: res_o <= x & y;
`QOP_OR: res_o <= x | y;
`QOP_XOR: res_o <= x ^ y;
`QOP_SHL: res_o <= x << y;// FIXME. Can we have signed y?
`QOP_ADD: res_o <= y + x; // FIXME. Explicit instatiate ADDSUB
`QOP_RSUB: res_o <= y - x; // ...
`QOP_MULT: res_o <= x * y; // FIXME. Explicit instatiate.
endcase
end
endmodule
// Stage 4: The Memory Access Stage
// ================================
//
module oga1cp_stg4_memio(clock, reset_, insn, val_or_addr, store_val,
wb_enable_o, wb_reg_o, wb_val_o);
//> Standard signals.
input clock;
input reset_;
//> Instruction from stage 3.
input[31:0] insn;
//> The result of stage 3, to be used as address for fetch/store or else
// passed though to the write-back.
input[31:0] val_or_addr;
//> For store, the value written to a[val_or_addr]. Not used for other modes.
input[31:0] store_val;
//> Control of register write-back.
output wb_enable_o;
output[4:0] wb_reg_o;
output[31:0] wb_val_o;
// -- end interface --
wire is_store = insn[`QMODE_BITS] == `QMODE_STORE;
reg is_fetch;
reg wb_enable_o;
reg[31:0] wb_val_if_not_fetch;
wire[31:0] wb_val_if_fetch;
reg[4:0] wb_reg_o;
always @(posedge clock) begin
is_fetch <= insn[`QMODE_BITS] == `QMODE_FETCH;
wb_val_if_not_fetch <= val_or_addr;
wb_enable_o <= !is_store;
wb_reg_o <= insn[`IZ_BITS];
end
assign wb_val_o = is_fetch? wb_val_if_fetch : wb_val_if_not_fetch;
RAMB16_S36_S36 local_memory(
.CLKA(clock), .SSRA(1'b0),
.ADDRA(val_or_addr[8:0]),
.ENA(1'b1), .DOA(wb_val_if_fetch), .DOPA(),
.WEA(is_store), .DIA(store_val), .DIPA(4'b0),
.CLKB(1'b0), .SSRB(1'b0),
.ADDRB(9'b0),
.ENB(1'b0), .DOB(), .DOPB(),
.WEB(1'b0), .DIB(32'b0), .DIPB(4'b0));
endmodule
oga1cp-asm.tar.bz2
Description: BZip2 compressed data
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
