Here's a first crack at an important part of the rasterizer. There are something on the order of 20 different numbers that have to be incremented every cycle. We're rasterizing triangles here, so for instance, we have X1, which is the left edge, and X2, which is the right edge. They also have increments dX1 and dX2 that are added to X1 and X2 to advance their values for one scanline to the appropriate values for the next. Almost all of these are 32-bit floats.
Now, we don't want to be doing full FP adds every cycle. That's too expensive, and totally unnecessary. Except for unusual circumstances, the exponent would usually not change, and when it does, it's by one. So what we can do is pre-process the floats coming in from the host, pre-aligning the base (X1) and increment (dX1dY) so they are denormalized and have the same exponent. This process can be fully pipelined and be transparent to the host. The preprocessor would hold the original X1 and dX1dY values, and whenever either is updated, the shifts are processed, and the aligned working values are forwarded to the actual rasterizer. The nice thing here is that this alignment logic can be shared among all base/increment pairs, thereby cutting out a lot of logic necessary to do normalized floats. Yet we sacrifice no precision. Below is a first stab at the logic that would do this math. It is able to handle sign changes (can only happen when the delta and base are opposite signs). It also will only shift right. I'm pretty sure there's little point in shifting right. This can happen when the base is large, and the delta is a small negative. Where the left-shifting would come in handy would be when the Y rasterizer forwards results to the X rasterizer. But really, there would also be a preprocessor for the X rasterizer that would align X1 with dX1dX. It's pointless to shift left in the X rasterizer, because we wouldn't want to try to recover any precision that was lost in the delta when it was shifted right in preprocessing. What follows is a first stab at this component. Following that, more discussion about the the next version. // This module takes two pre-shifted, denormalized, and aligned floating point // numbers and produces a sum. The sum may be shifted by one, and the // increment may also get shifted. // See http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/fp-extras.html // for info on guard bits. We expect three guard bits, the right-most of which // is the inclusive OR of all bits shifted off the right. This gives us // 23+1+3=27 mantissa bits // This block may shift right when the base gets too big, but it will never // shift left, because there's no point. // Every cycle, the outputs should be registered and fed back in as inputs. module raster_sum( // In-coming addends. 'a' is the base that gets incremented. // 'b' is the delta. input [7:0] exp_in, // Starting exponent (for both numbers) input sign_a_in, input sign_b_in, input [26:0] mantissa_a_in, input [26:0] mantissa_b_in, // Outputs // The sign of 'b' never changes, so we don't output that. Anything else // can change. output reg [7:0] exp_out, output reg sign_a_out, output reg [26:0] mantissa_a_out, output reg [26:0] mantissa_b_out); wire [28:0] sum1; addsub ad(.a({2'b0, mantissa_a_in}), .b({2'b0, mantissa_b_in}), .subtract(sign_a_in ^ sign_b_in), .c(sum)); wire [27:0] inverse = -sum1[27:0]; wire [7:0] next_exp = exp_in + 1; always @() begin case (sum1[28:27]) 0: begin // No shift, no sign change exp_out = exp_in; sign_a_out = sign_a_in; mantissa_a_out = sum1[26:0]; mantissa_b_out = mantissa_b_in; end 1: begin // Overflow, shift right exp_out = next_exp; sign_a_out = sign_a_in; mantissa_a_out = sum1[27:1]; mantissa_b_out = mantissa_b_in[26:1]; end 2: begin // Sign change and overflow, shift right exp_out = next_exp; sign_a_out = !sign_a_in; mantissa_a_out = inverse[27:1]; mantissa_b_out = mantissa_b_in[26:1]; end 3: begin // Sign change, no overflow exp_out = exp_in; sign_a_out = !sign_a_in; mantissa_a_out = inverse[26:0]; mantissa_b_out = mantissa_b_in[26:0]; end endcase end endmodule This logic isn't going to be fast enough. We'd be lucky if the 29-bit addsub could his 100MHz in the S3. Normally, we'd pipeline this, but we need to increment every cycle, and one stage of pipelining would cause results to be produced once every other cycle. However, there are some tricks we can play. We want this to USUALLY produce a result every cycle, but it's okay for it to skip a cycle now and then, as long as it's not too often. The shifts are relative cheap, but that sign change isn't. So I think the next step is to include the holding registers in the module, and have the module produce a "valid" bit for the output. On those occasions when a shift or sign change has to happen, cycle is inserted where there's no valid output. Then all we have to worry about is synchronizing all 20 of these units, because they'll go invalid at different times. I have some ideas for that too. There's another option worth discussing. Let these sums take two cycles. What you need is the parameter (X1), the parameter advanced by one step (X1+dX1dY), and two times the delta (2*dX1dY). On alternating cycles, you pass X1 and X1_next down the pipeline, thereby producing the right sequence of outputs, also updating each of those counters every other cycle. In fact, that might be the better option. Discuss! -- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
