In the series on architectural / performance rants. 

The last week I committed some basic verilog skeleton files which marks
the beginning of the rasterization module of the 3D engine. However, it
was one problem which struck me with the implementation of the logic
itself, namely how to solve pipeline stalls. A good example would be in
the horizontal rasterizer[1], where you can see that I have divided it
up into 3 basic stages, corresponding to the calculations found in the
new_model code. Basically:

        1) adjustment = some math
        2) initial point = more math * adjustment
        3) for each step in width:
                calculate values for step.

Now, the two first stages are easy, the cycle count from the entry of
data into stage 1 until it's ready for output in stage 2 is fixed, and
mainly depends on the latency introduced by the floating point
operations involved. However, the problem comes with stage 3. Since
stage 3 logically involves a loop over the width, it might have to stall
the pipeline while waiting for the while loop to finish processing. Thus
the question arise, what are we going to do with the data already
introduced into the pipeline?

The naive solution to the problem/challenge is to introduce a queue at
the end of each stage. This means that the depth of the queue has to be
at least the same number as the cycles used by the stage itself. This is
explained by the case where we have filled the entire pipeline of the
floating point module and we encounter a pipeline stall. In that case we
have to be able to store all of the output generated by the floating
points modules, since the FP modules have no mechanisms for stalling
themselves. Then the stage has to stall, not being able to accept any
data, nor processing anything until the stall ends.

My major problem with this solution is that it, as far as I can see, in
the case of a stall will introduce a latency in terms of startup costs.
If the queues are full, and if we encounter a stall from the next step
in the pipeline we cannot accept new incoming data either, since we
cannot guarantee that we will have space in the outgoing queue for the
results. Hence we will have a startup cost in terms of cycles equal to
the cycle count data uses through the pipeline step. 

Anybody have some good solutions? The easy answer is to say that it
doesn't cost that much compared to feature Y, but it feels a bit like
cheating. Other solution is to, in every pipeline step, incorporate a
"store/do not continue" flag and store the output in a register local to
the step. 

If you read through this entire mail,
Thank you for your attention :)

Regards,
Kenneth


[1] http://langly.org/og/rastHori.png


-- 
Life on the earth might be expensive, but it 
includes an annual free trip around the sun.

Kenneth Østby
http://langly.org

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to