Re: [Open-graphics] The Central Processor

Timothy Normand Miller Wed, 28 Mar 2007 18:16:45 -0800

On 3/28/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:

On 2007-03-28, Timothy Normand Miller wrote:
> >    `define QOP_SHL   3 // Left shift for rY > 0, right shift for rY < 0
>
> Typically we have separate instructions for left-shift,
> right-shift-logical (unsigned), and right-shift-arithmetic (signed).


But do we need unsigned arithmetic?  I can't think of any qualitative
gain of providing unsigned, and the quantitative gain on just one bit in
the width out of 32.  We don't have more that 2 GB memory, so it's not
needed for addressing.  The only thing I can think of is if CPU have to
multiply or right-shift unsigned numbers.  What kind of data would
require the full 32 bits, and at the same time use multiply or
right-shift?

If we need unsigned, I think we'll reduce immediates from 16 to 15 bits
to make room for the three new instructions (unsigned multiply, left and
right shift).


I'm having trouble thinking of a lot of reasons why we need both, but
at the very least, you'll need it for multi-precision computations.
Admittedly, MIPS isn't great for that, lacking carries, but the code
that you have to do instead isn't all that bad.

Yes, it's probably a bit less logic if we avoid doing the 2-complement
negation.  On the other hand, we loose a bit functionality when the
right hand side is a signed register, though it can still be written


Now that you mention it, simple negation is rather expensive.  Less so
than an add or subtract, but not that much less.  You for sure would
not want to feed the result of that into a barrel shifter.  Too much
delay.

However, consider this:

input [31:0] x;
input [5:0] shift;
output [31:0] z;
wire [63:0] y = {x, 31'b0};
assign z = y >> (shift + 32);

The shift+32 is cheap, because all you're really doing is inverting
the high bit of the shift amount.  The only question is how much more
expensive is a 64-bit shifter than a 32-bit shifter.

> Since any register can be specified as either operand, there's almost
> no benefit in having a way to subtract with the operand order
> reverse... just reverse them in the code.

Or let the assembler reverse them?


Yes, but Paul Brook pointed out a good reason.  Being able to computer
0-x as a way to do neg.  However, as a counter-argument to his
suggestion, we can always do r0-x for that special case.

or a "rsub".  There are also other derived instructions we can consider
like "move", "jump", "noop", "neg", ... just to make life easier when
programming it.


Yes, quite a number of MIPS mnemonics are aliases for something else.

> The built-in multipliers are 18*18 -> 36.  I think you need four of
> them to make 36*36->72.  They're signed at 36 bits, so for us to do
> signed or not, we just need to decide whether or not we replicate the
> high bits of each operand out to the full word length.

Hmmm.  Maybe we'll start thinking of code and seed what we need?  If we


Yes.

Again, we should probably check what we'll need.


Yes.

In fact, we have the luxury of not adding the instruction until later
(as long as we reserve space for it).  When we determine the true
needs, we can add what we need to without having useless instructions
limiting our clock rate.


Yes, we can create a nicer common-case instruction set on top of these
low level instructions.


That's the beauty of orthogonality.  :)

> This one is frustrating to me.  We really don't want to have to
> multiplex between Y and Z when looking up the register value, just for
> this one class of instructions.  We'd prefer to always read X and Y
> and always write Z.  But you have the conflicting requirement of
> wanting to use only an immediate value for the offset.

I didn't like the muxing over {y, z} myself, but I couldn't find a
better solution without giving out the REG + CONST addressing mode or
make the Y register and the Y immediate non-overlapping.


Basically, this is just going to be an annoyance that doesn't go away.
However, what we can do is try it various ways and see which has a
higher clock rate.  :)

> However, I presume that the return address is forwarded through the
> ALU to WB so that it appears in the register file.  I guess we'll want
> to generate an instruction that splices in the return address and adds
> it to zero and is also a no-op for MEM.  We need to diagram this out
> so that we know when the return address is actually available to be
> used.

The current branching instruction will pass the program counter just
past the branching instruction into register z.  Because of the
one-cycle delay of the branching, to instruction just-passed the branch
will be executed again at return.  If we don't want that, we could add
another increment.


This is why I added 2 to the PC (well, next_pc or whatever it was),
and passed that down the pipeline.  This way, what's stored is the
branch target.

Let's say, however, that as part of the feedback from REG to FETCH, we
pass not just the register value that contains the branch target but
also an offset (imm only?   or selectably reg value?).  Then the
address we branch to is the sum of two numbers, so we can add in an
offset.  This way, the return address stored can be 1 less than where
we want to go, so the "RET" opcode actually includes an offset.

The problem is that we end up with the the program file load address
being fed from not just MUXes but also an adder, which is too many
levels of logic.

The truth is that adding a constant of 2 is expensive but not so
horribly expensive that we shouldn't consider it.

More tradeoffs.  What else is CPU design?  :)


> Usually, the bit-bucket is r0 for MIPS.  Why not do the same?

Maybe?  What if we have other special registers, like a stack pointer,
do we continue allocating r1, r2 for these, or do continue from r31?
Well, anything goes.


I have some vague recollection of MIPS actually having specialized
stack-handling instructions.  Maybe I'm thinking of something else.
There's actually a stack pointer that's in a special purpose register
outside of the main 32-entry register file.  I don't know if we care
to do that or not.  We're not multitasking, so the overhead of storing
it to do a context switch isn't a concern for us.

We should avoid special-purpose registers as much as possible.  I
don't even want special condition registers.  Right now, we have the
PC, main registers, and a memory-mapped I/O space.


This sounds like a good idea.  Before reading your post I write this
addition to the REG IO stages:

always @(posedge clock_2x)
    if (phase == 1) begin // falling edge of clock (right?)


Yeah.  The idea is to have phase match, for most of the clock cycle,
the value of clock_1x.  So if phase==1, then we're about to encounter
a falling edge of clock_1x.

        if (wb_enable)
            regfile[wb_reg] <= wb_val;
    end else begin
        if (insn[`QMODE_BITS] == `QMODE_BRANCH)
            x_o <= pc + 1;
        else if (ix == `BITBUCKET_REG)
            x_o <= 0;
        else
            x_o <= regfile[ix];
        yz_o <= regfile[iyz];
    end


This'll work, although it reads the register file asynchronously.
You'll get more performance out of making is synchronous, because the
register is "for free" in the slices that contain the memory.

I like your formatting, BTW, although I think the proportional font in
gmail is ruining it.  :)

What we should do, really, is instantiate a dual-port synchronous
memory and then cope with the limitations around it.

In our module, we use stylized code so that the synthesizer infers the
right thing.

always @(posedge clock) begin
   if (write_enable) begin
       memory[addr0] <= write_data;
   end else begin
       read_data0 <= memory[addr0];
       read_data1 <= memory[addr1];
   end
end

And other stuff in there.  And this may not be exactly right.  We may
find ourselves using some IP generator from Lattice or Xilinx to
produce the right thing.  The idea is to be able to do two sync reads
at the same time or one write.

That is, the bitbucket is special only when use as an X operand, to it
can still be used for temporary values.  But, I think I've made the
mistake of putting a computation between regfile[ix] and it's
registration.  If I've understod your previous comments, we should
instead mux after the lookup:

    always ...
        x_o_if_reg <= regfile[ix]
        x_o_if_pc <= pc + 1
    assign x_o = ... x_o_if_reg ... x_o_if_pc ... 0

Right?


Yes.


So, we'll probably be better off storing 0 in the bit-bucket.


R0 should be the bitbucket in all cases.  It's very important to the
simplicity of our instruction set to be able to always rely on R0
being zero on reads and being a place where we can throw away results.

--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] The Central Processor

Reply via email to