Here's a first crack at an important part of the rasterizer.  There
are something on the order of 20 different numbers that have to be
incremented every cycle.  We're rasterizing triangles here, so for
instance, we have X1, which is the left edge, and X2, which is the
right edge.  They also have increments dX1 and dX2 that are added to
X1 and X2 to advance their values for one scanline to the appropriate
values for the next.  Almost all of these are 32-bit floats.

Now, we don't want to be doing full FP adds every cycle.  That's too
expensive, and totally unnecessary.  Except for unusual circumstances,
the exponent would usually not change, and when it does, it's by one.
So what we can do is pre-process the floats coming in from the host,
pre-aligning the base (X1) and increment (dX1dY) so they are
denormalized and have the same exponent.  This process can be fully
pipelined and be transparent to the host.  The preprocessor would hold
the original X1 and dX1dY values, and whenever either is updated, the
shifts are processed, and the aligned working values are forwarded to
the actual rasterizer.  The nice thing here is that this alignment
logic can be shared among all base/increment pairs, thereby cutting
out a lot of logic necessary to do normalized floats.  Yet we
sacrifice no precision.


Below is a first stab at the logic that would do this math.  It is
able to handle sign changes (can only happen when the delta and base
are opposite signs).  It also will only shift right.

I'm pretty sure there's little point in shifting right.  This can
happen when the base is large, and the delta is a small negative.
Where the left-shifting would come in handy would be when the Y
rasterizer forwards results to the X rasterizer.  But really, there
would also be a preprocessor for the X rasterizer that would align X1
with dX1dX.  It's pointless to shift left in the X rasterizer, because
we wouldn't want to try to recover any precision that was lost in the
delta when it was shifted right in preprocessing.

What follows is a first stab at this component.  Following that, more
discussion about the the next version.





// This module takes two pre-shifted, denormalized, and aligned floating point
// numbers and produces a sum.  The sum may be shifted by one, and the
// increment may also get shifted.
// See http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/fp-extras.html
// for info on guard bits.  We expect three guard bits, the right-most of which
// is the inclusive OR of all bits shifted off the right.  This gives us
// 23+1+3=27 mantissa bits
// This block may shift right when the base gets too big, but it will never
// shift left, because there's no point.
// Every cycle, the outputs should be registered and fed back in as inputs.
module raster_sum(
    // In-coming addends.  'a' is the base that gets incremented.
    // 'b' is the delta.
    input [7:0] exp_in,     // Starting exponent (for both numbers)
    input sign_a_in,
    input sign_b_in,
    input [26:0] mantissa_a_in,
    input [26:0] mantissa_b_in,

    // Outputs
    // The sign of 'b' never changes, so we don't output that.  Anything else
    // can change.
    output reg [7:0] exp_out,
    output reg sign_a_out,
    output reg [26:0] mantissa_a_out,
    output reg [26:0] mantissa_b_out);

wire [28:0] sum1;
addsub ad(.a({2'b0, mantissa_a_in}),
          .b({2'b0, mantissa_b_in}),
          .subtract(sign_a_in ^ sign_b_in),
          .c(sum));


wire [27:0] inverse = -sum1[27:0];
wire [7:0] next_exp = exp_in + 1;

always @() begin
    case (sum1[28:27])
        0: begin  // No shift, no sign change
            exp_out = exp_in;
            sign_a_out = sign_a_in;
            mantissa_a_out = sum1[26:0];
            mantissa_b_out = mantissa_b_in;
        end
        1: begin  // Overflow, shift right
            exp_out = next_exp;
            sign_a_out = sign_a_in;
            mantissa_a_out = sum1[27:1];
            mantissa_b_out = mantissa_b_in[26:1];
        end
        2: begin  // Sign change and overflow, shift right
            exp_out = next_exp;
            sign_a_out = !sign_a_in;
            mantissa_a_out = inverse[27:1];
            mantissa_b_out = mantissa_b_in[26:1];
        end
        3: begin  // Sign change, no overflow
            exp_out = exp_in;
            sign_a_out = !sign_a_in;
            mantissa_a_out = inverse[26:0];
            mantissa_b_out = mantissa_b_in[26:0];
        end
    endcase
end

endmodule



This logic isn't going to be fast enough.  We'd be lucky if the 29-bit
addsub could his 100MHz in the S3.  Normally, we'd pipeline this, but
we need to increment every cycle, and one stage of pipelining would
cause results to be produced once every other cycle.  However, there
are some tricks we can play.  We want this to USUALLY produce a result
every cycle, but it's okay for it to skip a cycle now and then, as
long as it's not too often.  The shifts are relative cheap, but that
sign change isn't.  So I think the next step is to include the holding
registers in the module, and have the module produce a "valid" bit for
the output.  On those occasions when a shift or sign change has to
happen, cycle is inserted where there's no valid output.  Then all we
have to worry about is synchronizing all 20 of these units, because
they'll go invalid at different times.  I have some ideas for that
too.

There's another option worth discussing.  Let these sums take two
cycles.  What you need is the parameter (X1), the parameter advanced
by one step (X1+dX1dY), and two times the delta (2*dX1dY).  On
alternating cycles, you pass X1 and X1_next down the pipeline, thereby
producing the right sequence of outputs, also updating each of those
counters every other cycle.  In fact, that might be the better option.
 Discuss!


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to