Re: [Open-graphics] The Central Processor

Timothy Normand Miller Sun, 01 Apr 2007 08:19:24 -0700

On 4/1/07, Petter Urkedal <[EMAIL PROTECTED]> wrote:

Thanks to Paul, Nicholas and Tim for the feedback.


I have a new verion which I'd like to check into the repository.  The
register-fetch is the most changed, esp to handle register-forwarding.
I'll mention some of the other changes below.  The new version is at
http://www.eideticdew.org/~urkedal/ogp/, but I'll remove the special
treatment of r31.


Please go ahead and check this in.  Do we need to discuss where to put
it and what to name it?

Also, please don't forget to put the full copyright and license notice
at the top of every file.  Traversal won't be able to use any code
lacking that in any commercial product.

> The shift+32 is cheap, because all you're really doing is inverting
> the high bit of the shift amount.  The only question is how much more
> expensive is a 64-bit shifter than a 32-bit shifter.

Taking that argument into another direction... Since adding constants to
the RHS of a shift operator is free, we can turn an expensive
2-complement into a cheap 1-complement, so we have a cheap way to get an
almost correct shift operator:

    case ...
        `QOP_SHL:  res_o <= y[31]? x[31:1] >>> ~y[4:0] : x << y[4:0];

The just almost correct, since I've taken Paul's suggestion of ignoring
oddities for long-shifts, but it correctly handles signed RHS registers.


That is very clever.


We can leave it open whether we want arithmetic shift, logical shift, or
both.  If we go with all-signed, then arithmetic shift is the most
logical choice.  On the other hand, shifts are often used to select
bitfields in a word, and logical shifts allows writing x >> SHIFT
instead of (x >> SHIFT) & MASK when selecting an unsigned uppermost
bitfield (though, x >> SHIFT would be correct for a signed uppermost
bitfield).  See also below for my suggestion of a byte-shuffeling move
instruction.


Here's the thing.  What we need to do is just pick a solution and go
with it.  We could perform experiments and find that one way is 5% or
even 50% better than another.  But we're talking about a small part of
the processor, and a small part of a rapid prototype at that.

What we need to do is make a prototype that functions and evaluate it
carefully so we know how to redesign it.  It's possible, for instance,
that we might not want a shifter at all.

Consider this:

wire [31:0] result = (x * y) >> 16;

What's this?  If x and y are 16.16 fixed-point numbers, then we can
use this to do non-int math.  And given the right operations (or a
sequence of), we can perform all the right and left shifts we need.  I
mean, really, the numbers are only what you interpret them to be, and
the operation is just an operation.

Ok, there are probably HUGE holes in this idea, and it's probably a
case of me trying to be too clever.  But it's something to have in
mind when we move forward on the design.

What I've done here is that I've put the PC increment into the ALU,
since it needs to go though that anyway, but currently it is separate
from the ALU ADDSUB unit.  If some extra muxing here does not cause a
timing bottleneck, then it's easy to change the ALU to re-use the ADDSUB
unit for the PC increment.


Good idea.  We have to add something to the PC and write it to the
register, and it has to go through ALU anyhow.  And a bit of MUXing is
probably cheaper and definitely faster than another adder.  (Well,
that depends, since our PC is only 9 to 11 bits, but whatever.)

We need to make this prototype so that it synthesizes.  Given the
static timing analysis, we may be surprised as to what the bottlenecks
are.

> R0 should be the bitbucket in all cases.  It's very important to the
> simplicity of our instruction set to be able to always rely on R0
> being zero on reads and being a place where we can throw away results.

If we add a move instruction, then we don't need to treat any register
specially from the CPU's point of view.  We can make the move instruction
a bit more powerful, as well:


Ok.  Normally, we'd use ADD or OR or something with zero, but if we
have other special things to do, we can have a specialized
instruction.  This may be necessary for shunts.

Since we may be working with pixels, I suppose it could be useful to
allow the move instruction to optionally shuffle around the bytes, and
optionally mask out all but the lower byte/word.  The move instruction
does not use the x-register bits of the instruction, so we have 5 free
bits.  What I currently have in the code is


The most important thing to be efficient with is data movement between
PCI and memory.  That's where specialized things for shunts come into
place.

The second most important thing is the sort of bit extraction
necessary to decode DMA command packets.  But there's something to say
about this.  The GPU may not be able to keep up with the packet rate
for some operations... although it may be too fast for others.  The
point is that there are cases where inefficiencies in the CPU won't
hurt us.

I was thinking about bit picking instructions, but then I realized
that we can just left-shift any bit we want to examine into the sign
bit of a temp register and then use branch-neg and branch-not-neg
instructions.

As for processing pixels, we should put absolutely no effort into
making VGA emulation efficient.  Seriously.  We need to emulate it,
but not quickly or well.  For text modes, we'll spend our time scaning
graphics memory in the background, continuously retranslating the text
to pixels.  That can be 5 frames/sec, and no one will care.  For
graphics modes, we need to trap reads and writes, and that'll be very
slow, but it would be anyhow.  But VGA at the bottom of your list of
prorities.

// Byte-shuffeling is used only for move instructions.  It reuses the bits of
// the x-register to specify what to shuffle.
wire[4:0] shuffle = insn[`X_REG_BITS];
wire[31:0] yw =  // Bit 0: swap words
    shuffle[0]? {y[15:0], y[31:16]} : y;
wire[31:0] yb =  // Bit 1: swap bytes
    shuffle[1]? {shuffle[2] & yw[23:16], yw[31:24], yw[7:0], yw[15:8]} : yw;
wire[31:0] ys =  // Bits 2..3: Filter all but low byte/word:
    {{16{shuffle[3]}} & yb[31:16], {8{shuffle[2]}} & yb[15:8], yb[7:0]};


Interesting.  Specialized byte-handling instructions.  I like it.
Compare that to shift+and to be sure that it's really helpful.  But
it's a good idea.  Also, it works well for 32-bit pixels (which is
basically all we handle).  But keep in mind that VGA is bit-oriented.

As a side note, I plan to have a translation mechanism in the path
between PCI and graphics memory that will perform some kinds of byte
twiddling.  Besides handling endianness, I want to be able to convert
bytes or 16-bit pixels into 32-bit pixels on the way out.  This way,
we can make the host THINK that the GPU can handle 8-bit pixels for
some situations.

[... and, under case operator ...]

            `QOP_MOVE: res_o <= ys;     // move with optional byte-shuffle

The y-register bits are set to binary 01100 for plain moves.



--
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] The Central Processor

Reply via email to