2009/5/5 Timothy Normand Miller <[email protected]>

> Here's a first crack at an important part of the rasterizer.  There
> are something on the order of 20 different numbers that have to be
> incremented every cycle.  We're rasterizing triangles here, so for
> instance, we have X1, which is the left edge, and X2, which is the
> right edge.  They also have increments dX1 and dX2 that are added to
> X1 and X2 to advance their values for one scanline to the appropriate
> values for the next.  Almost all of these are 32-bit floats.


Weren't we using 25 bits float internally?


>
> Now, we don't want to be doing full FP adds every cycle.  That's too
> expensive, and totally unnecessary.  Except for unusual circumstances,
> the exponent would usually not change, and when it does, it's by one.
> So what we can do is pre-process the floats coming in from the host,
> pre-aligning the base (X1) and increment (dX1dY) so they are
> denormalized and have the same exponent.  This process can be fully
> pipelined and be transparent to the host.  The preprocessor would hold
> the original X1 and dX1dY values, and whenever either is updated, the
> shifts are processed, and the aligned working values are forwarded to
> the actual rasterizer.  The nice thing here is that this alignment
> logic can be shared among all base/increment pairs, thereby cutting
> out a lot of logic necessary to do normalized floats.  Yet we
> sacrifice no precision.


By doing the preprocessing first and keeping only a partiall value of the dx
or Dy who is prealigned in the mantisse we loose precision especially if we
do a lot of successive operation.


>
>
>
> Below is a first stab at the logic that would do this math.  It is
> able to handle sign changes (can only happen when the delta and base
> are opposite signs).  It also will only shift right.
>
> I'm pretty sure there's little point in shifting right.  This can
> happen when the base is large, and the delta is a small negative.
> Where the left-shifting would come in handy would be when the Y
> rasterizer forwards results to the X rasterizer.  But really, there
> would also be a preprocessor for the X rasterizer that would align X1
> with dX1dX.  It's pointless to shift left in the X rasterizer, because
> we wouldn't want to try to recover any precision that was lost in the
> delta when it was shifted right in preprocessing.
>
> What follows is a first stab at this component.  Following that, more
> discussion about the the next version.
>
>
>
>
>
> // This module takes two pre-shifted, denormalized, and aligned floating
> point
> // numbers and produces a sum.  The sum may be shifted by one, and the
> // increment may also get shifted.
> // See 
> http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/fp-extras.html<http://www.cs.nmsu.edu/%7Epfeiffer/classes/473/notes/fp-extras.html>
> // for info on guard bits.  We expect three guard bits, the right-most of
> which
> // is the inclusive OR of all bits shifted off the right.  This gives us
> // 23+1+3=27 mantissa bits
> // This block may shift right when the base gets too big, but it will never
> // shift left, because there's no point.
> // Every cycle, the outputs should be registered and fed back in as inputs.
> module raster_sum(
>    // In-coming addends.  'a' is the base that gets incremented.
>    // 'b' is the delta.
>    input [7:0] exp_in,     // Starting exponent (for both numbers)
>    input sign_a_in,
>    input sign_b_in,
>    input [26:0] mantissa_a_in,
>    input [26:0] mantissa_b_in,
>
>    // Outputs
>    // The sign of 'b' never changes, so we don't output that.  Anything
> else
>    // can change.
>    output reg [7:0] exp_out,
>    output reg sign_a_out,
>    output reg [26:0] mantissa_a_out,
>    output reg [26:0] mantissa_b_out);
>
> wire [28:0] sum1;
> addsub ad(.a({2'b0, mantissa_a_in}),
>          .b({2'b0, mantissa_b_in}),
>          .subtract(sign_a_in ^ sign_b_in),
>          .c(sum));


Should be ".c(sum1));"


>
>
>
> wire [27:0] inverse = -sum1[27:0];
> wire [7:0] next_exp = exp_in + 1;
>
> always @() begin
>    case (sum1[28:27])
>        0: begin  // No shift, no sign change
>            exp_out = exp_in;
>            sign_a_out = sign_a_in;
>            mantissa_a_out = sum1[26:0];
>            mantissa_b_out = mantissa_b_in;
>        end
>        1: begin  // Overflow, shift right
>            exp_out = next_exp;
>            sign_a_out = sign_a_in;
>            mantissa_a_out = sum1[27:1];
>            mantissa_b_out = mantissa_b_in[26:1];
>        end
>        2: begin  // Sign change and overflow, shift right
>            exp_out = next_exp;
>            sign_a_out = !sign_a_in;
>            mantissa_a_out = inverse[27:1];
>            mantissa_b_out = mantissa_b_in[26:1];
>        end
>        3: begin  // Sign change, no overflow
>            exp_out = exp_in;
>            sign_a_out = !sign_a_in;
>            mantissa_a_out = inverse[26:0];
>            mantissa_b_out = mantissa_b_in[26:0];
>        end
>    endcase
> end
>
> endmodule
>
>
>
> This logic isn't going to be fast enough.  We'd be lucky if the 29-bit
> addsub could his 100MHz in the S3.  Normally, we'd pipeline this, but
> we need to increment every cycle, and one stage of pipelining would
> cause results to be produced once every other cycle.  However, there
> are some tricks we can play.  We want this to USUALLY produce a result
> every cycle, but it's okay for it to skip a cycle now and then, as
> long as it's not too often.  The shifts are relative cheap, but that
> sign change isn't.  So I think the next step is to include the holding
> registers in the module, and have the module produce a "valid" bit for
> the output.  On those occasions when a shift or sign change has to
> happen, cycle is inserted where there's no valid output.  Then all we
> have to worry about is synchronizing all 20 of these units, because
> they'll go invalid at different times.  I have some ideas for that
> too.
>

To do a pipeline we could always interleave 2 or 3 raster on the same unit.
The control logic would cost a little bit more. But it would still be
relatively small and if it help with the speed we could just not care to
much about it. For the negate value do we really need to care for that? The
color value can't go negative, the rest of the data I don't know, maybe the
XY value?


>
> There's another option worth discussing.  Let these sums take two
> cycles.  What you need is the parameter (X1), the parameter advanced
> by one step (X1+dX1dY), and two times the delta (2*dX1dY).  On
> alternating cycles, you pass X1 and X1_next down the pipeline, thereby
> producing the right sequence of outputs, also updating each of those
> counters every other cycle.  In fact, that might be the better option.
>  Discuss!
>

That's another possibility, but to calculate the curent and the next value
it would require that we precompute the (2*dX1dY). If we interleave we could
interleave the X1+dXdY and the Y1+dXdY or  X1+dXdY and the other line
X1+dXdY1.


>
> --
> Timothy Normand Miller
> http://www.cse.ohio-state.edu/~millerti<http://www.cse.ohio-state.edu/%7Emillerti>
> Open Graphics Project
> _______________________________________________
> Open-graphics mailing list
> [email protected]
> http://lists.duskglow.com/mailman/listinfo/open-graphics
> List service provided by Duskglow Consulting, LLC (www.duskglow.com)
>
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to