On 2009-10-15, Timothy Normand Miller wrote: > I've drawn this to illustrate the design of our shader engine: > > http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf > > Several years ago, I designed an fp adder and an fp multiplier. IIRC, > they required 6 pipeline stages. Andre did something similar, but I > don't know what his results were. Also, Andre mentioned that there > are sections of the pipelines for those units that are similar enough > that they could be combined. Putting this together, we have a 9-stage > pipeline, which would imply 10 time slots for 10 (or more) tasks > assigned to the engine at once. It's 10 because 1/2 a BRAM stores the > regfile for one task.
If we have one "free" thread, shouldn't we instead add a 1-slot queue in front of the pipeline, so that the pipeline still runs at full capacity while we have one pending load? > Let's say that contexts 0 and 1 share a BRAM for regfile, and 2 is > with 3, etc. The table below is an example of an order in which > instructions from these contexts could be issued so that we never have > a write-back happening to the same BRAM that's being read. Each row > is a pipeline stage where 0 is fetch and 1 is decode/regfile. The row > labeled wb is the writeback which is _also_ stage 1. The key is to > ensure that the thread number in wb does not share a RAM block with > the thread in decode. The first column is the row labels, and the > remaining columns are the thread numbers of instructions in each of > those stages in steady-state. > > Notice also how the thread in wb on one cycle is the thread in fetch > on the following cycle. When a branch instruction is fetched, it's > passed to decode where the condition is looked up and fed back to > fetch. At that point, we have as many cycles as we want to calculate > the next address (be it the branch target or PC+1). We can think > about clever ways to store program counters in a dual-ported > distributed RAM so that before one cycle before we need a PC, the > correct value is already in the table. > > stage > > 0 0 2 4 6 8 1 3 5 7 > 9 0 > 1 9 0 2 4 6 8 1 3 5 > 7 9 > 2 7 9 0 2 4 6 8 1 3 > 5 7 > 3 5 7 9 0 2 4 6 8 1 > 3 5 > 4 3 5 7 9 0 2 4 6 8 > 1 3 > 5 1 3 5 7 9 0 2 4 6 > 8 1 > 6 8 1 3 5 7 9 0 2 4 > 6 8 > 7 6 8 1 3 5 7 9 0 2 > 4 6 > 8 4 6 8 1 3 5 7 9 0 > 2 4 > wb 2 4 6 8 1 3 5 7 9 > 0 2 We can do it this way if we need to. Then we must make sure that when a thread is picked up after a load it is only plugged into a compatible slot. But, aren't all the BRAMs dual-ported, and can't we then use one port for each task? _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
