Re: [Open-graphics] The Central Processor

Petter Urkedal Wed, 28 Mar 2007 14:16:46 -0800

On 2007-03-28, Timothy Normand Miller wrote:
> >    `define QOP_SHL   3 // Left shift for rY > 0, right shift for rY < 0
> 
> Typically we have separate instructions for left-shift,
> right-shift-logical (unsigned), and right-shift-arithmetic (signed).


But do we need unsigned arithmetic?  I can't think of any qualitative
gain of providing unsigned, and the quantitative gain on just one bit in
the width out of 32.  We don't have more that 2 GB memory, so it's not
needed for addressing.  The only thing I can think of is if CPU have to
multiply or right-shift unsigned numbers.  What kind of data would
require the full 32 bits, and at the same time use multiply or
right-shift?

If we need unsigned, I think we'll reduce immediates from 16 to 15 bits
to make room for the three new instructions (unsigned multiply, left and
right shift).

> You COULD get the direction out of the sign bit of the shift amount,
> but compilers aren't used to this, and depending on how many unused
> opcodes you have, it might require less logic to just have more than
> one shift operation than to have to look at another signals to
> determine direction.  Plus, you can't distinguish between signed and
> unsigned shifts this way.

Yes, it's probably a bit less logic if we avoid doing the 2-complement
negation.  On the other hand, we loose a bit functionality when the
right hand side is a signed register, though it can still be written

        ;; Compute r0 << r1 for signed r1.
        bneg r1, L0, BUCKET_REG
          noop
        btrue L1
          shl r0, r1, r0
L0:     sub ZERO_REG, r1, r1
        shr r0, r1, r0
        sub ZERO_REG, r1, r1
L1:

I actually already change the code to

        case (insn[`QOP_BITS])
            ...
            `QOP_SHL:  res_o <= y >= 0? x << y : x >> -y;
            ...

I don't know how expensive that is.

> 
> >    `define QOP_ADD   4
> >    `define QOP_RSUB  5 // Reversed args for maximum flexibility.
> 
> Since any register can be specified as either operand, there's almost
> no benefit in having a way to subtract with the operand order
> reverse... just reverse them in the code.

Or let the assembler reverse them?

>  (Except for one case
> involving an immediate.)

Yes, that was the reason I did it.  I considered it a free feature that
we can compute (- REG + signed_const), and I planned to fix it up in the
assembler by providing a "sub" with may be replaced with either an "add"
or a "rsub".  There are also other derived instructions we can consider
like "move", "jump", "noop", "neg", ... just to make life easier when
programming it.

>  We do need separate ADD and SUB opcodes.
> Also, making one an even number and the other odd helps, because we
> can route the low bit of the opcode into the addsub to set its
> direction.

Good idea, I'll do that.

> The way we get an addsub inferred is to put it into a module by itself
> so that the heuristics in the synthesizer will recognize it correctly.
> Basically what we want is this:
> 
> module my_addsub(
>    input [31:0] A,
>    input [31:0] B,
>    output [31:0] C,
>    input sub);
> assign C = sub ? (A-B) : (A+B);
> endmodule
> 
> In my experience, both Lattice and Xilin synthesizers have done the
> right thing with this.

Thanks.

> >    `define QOP_MULT  6
> >
> >If we combine these with the QMODE_ARITH mode, then we get the instructions
> >
> >    and rX, rcY, rZ     ; rZ := rX & rcY
> >    or rX, rcY, rZ
> >    xor rX, rcY, rZ
> >    lsh rX, rcY, rZ     ; rZ := if rcY < 0 then rX >> -rcY else rX << rcY
> >    add rX, rcY, rZ     ; rZ := rcY + rZ
> >    rsub rX, rcY, rZ    ; rZ := rcY - rX
> >    mult rX, rcY, rZ    ; rZ := rX * rcY        ; Note: always signed!
> 
> Now that you mention that, perhaps we'd like to have signed and
> unsigned multiply instructions.
> 
> The built-in multipliers are 18*18 -> 36.  I think you need four of
> them to make 36*36->72.  They're signed at 36 bits, so for us to do
> signed or not, we just need to decide whether or not we replicate the
> high bits of each operand out to the full word length.

Hmmm.  Maybe we'll start thinking of code and seed what we need?  If we
chain the multiplies like this, isn't that a receipe to loose clock
speed?  Maybe we can manage with 18*18 or 18*32 multiplies?  The most
common case I'd guess is 32 bits * small constant.  If 32*32 cases are
rare, and we gain speed by reducing width, we may gain in general by
using use more instructions here?

> The other thing to be figured out is how to deal with the upper 32
> bits.  If you multiply two 32-bit numbers, you get a 64-bit word.
> [...]

Yours suggestions here sounds fine to be, I don't have any better.
Again, we should probably check what we'll need.

> >Memory operations are handled by Stage 4, and therefore has access to
> >the ALU result of Stage 3.  We use the ALU result as the address of
> >memory operations.  Thus, for each arithmetic instruction, there is one
> >fetch (QMODE_FETCH) and one store (QMODE_STORE) instruction.  I'll use
> >"add" as an example, and it's corresponding fetch variant we'll denote
> >by
> >
> >    add rX, rcY fetch rZ
> 
> MIPS does exactly this.  But I like how you split it out, having the
> ALU op in a fixed field and the memory op in another fixed field.
> Very RISC-like thinking.  Hypothetically, we could just as well do
> other ALU ops to generate an address, while I have the impression that
> the MIPS processor hard-codes the ADD operation in the ALU for address
> generation.  We'll see if this becomes useful... we should present it
> to the programmer as the available set of addressing modes.

Yes, we can create a nicer common-case instruction set on top of these
low level instructions.
> 
> >    add rX, cY store rZ
> >
> >Again the instruction format is the same as for "add", except that
> >QMODE_BITS have the value QMODE_STORE.  The meaning is that rZ is stored
> >to address rX + cY.  There are, however, some things to note about this.
> >First, there are 3 inputs here, but the register-fetch can only fetch 2
> >registers.  Therefore, the store instruction must always be immediate
> >(cY), so the address is only computed from one register.  Second, this
> >is the only instruction where rZ is *input*, and write-back is disabled.
> >This logic you can find in Stage 2, 5, in particular the computation of
> >iyz and yz_o.
> 
> This one is frustrating to me.  We really don't want to have to
> multiplex between Y and Z when looking up the register value, just for
> this one class of instructions.  We'd prefer to always read X and Y
> and always write Z.  But you have the conflicting requirement of
> wanting to use only an immediate value for the offset.

I didn't like the muxing over {y, z} myself, but I couldn't find a
better solution without giving out the REG + CONST addressing mode or
make the Y register and the Y immediate non-overlapping.
 
> Could we perhaps restrict ourselves to using bits 10:0 of the
> immediate as the offset in this case?  This way, we read X and Y, add
> the offset to whichever is the address, and then write the data.
> 
> The thing is, this is probably just pushing complexity off from one
> stage onto another.

Yes, I think we are just pushing the complexity around.  The current MUX
is on the input register file index.  I think your suggestion would move
the MUX to the y-output of the register-fetch.

> >=== Branching ===
> >
> >Finally, branches are different from the above, since they don't use the
> >ALU.  The ALU is skipped here partly because it's very critical to load
> >the target address back into stage 1 (instruction fetch) as soon as
> >possible.  That also means the ALU bits in the instruction word can be
> >used to specify the condition of the branch.  The relevant bits of the
> >instruction word are
> 
> However, I presume that the return address is forwarded through the
> ALU to WB so that it appears in the register file.  I guess we'll want
> to generate an instruction that splices in the return address and adds
> it to zero and is also a no-op for MEM.  We need to diagram this out
> so that we know when the return address is actually available to be
> used.

The current branching instruction will pass the program counter just
past the branching instruction into register z.  Because of the
one-cycle delay of the branching, to instruction just-passed the branch
will be executed again at return.  If we don't want that, we could add
another increment.

> Usually, the bit-bucket is r0 for MIPS.  Why not do the same?

Maybe?  What if we have other special registers, like a stack pointer,
do we continue allocating r1, r2 for these, or do continue from r31?
Well, anything goes.

> BTW, I was thinking about how to make r0 always read zero in the FPGA.
> We'd really rather not MUX in a zero after the register file.
> There's too much MUXing already.  Rather, we'd like the register file
> to actually contain a zero in r0.  After some thought, I realized that
> we could design WB to write zero to r0 while the system reset is
> active and also to cancel any write to r0 the rest of the time.  This
> way, the register simply contains a zero, so reading the bit-bucket is
> cheap.  Only if we get hit by a cosmic ray do we have a problem.

This sounds like a good idea.  Before reading your post I write this
addition to the REG IO stages:

always @(posedge clock_2x)
    if (phase == 1) begin // falling edge of clock (right?)
        if (wb_enable)
            regfile[wb_reg] <= wb_val;
    end else begin
        if (insn[`QMODE_BITS] == `QMODE_BRANCH)
            x_o <= pc + 1;
        else if (ix == `BITBUCKET_REG)
            x_o <= 0;
        else
            x_o <= regfile[ix];
        yz_o <= regfile[iyz];
    end

That is, the bitbucket is special only when use as an X operand, to it
can still be used for temporary values.  But, I think I've made the
mistake of putting a computation between regfile[ix] and it's
registration.  If I've understod your previous comments, we should
instead mux after the lookup:

    always ...
        x_o_if_reg <= regfile[ix]
        x_o_if_pc <= pc + 1
    assign x_o = ... x_o_if_reg ... x_o_if_pc ... 0

Right?

So, we'll probably be better off storing 0 in the bit-bucket.
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] The Central Processor

Reply via email to