On 2009-10-15, Timothy Normand Miller wrote:
> I've drawn this to illustrate the design of our shader engine:
> 
> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf
> 
> Several years ago, I designed an fp adder and an fp multiplier.  IIRC,
> they required 6 pipeline stages.  Andre did something similar, but I
> don't know what his results were.  Also, Andre mentioned that there
> are sections of the pipelines for those units that are similar enough
> that they could be combined.  Putting this together, we have a 9-stage
> pipeline, which would imply 10 time slots for 10 (or more) tasks
> assigned to the engine at once.  It's 10 because 1/2 a BRAM stores the
> regfile for one task.

If we have one "free" thread, shouldn't we instead add a 1-slot queue in
front of the pipeline, so that the pipeline still runs at full capacity
while we have one pending load?

> Let's say that contexts 0 and 1 share a BRAM for regfile, and 2 is
> with 3, etc.  The table below is an example of an order in which
> instructions from these contexts could be issued so that we never have
> a write-back happening to the same BRAM that's being read.  Each row
> is a pipeline stage where 0 is fetch and 1 is decode/regfile. The row
> labeled wb is the writeback which is _also_ stage 1.  The key is to
> ensure that the thread number in wb does not share a RAM block with
> the thread in decode.  The first column is the row labels, and the
> remaining columns are the thread numbers of instructions in each of
> those stages in steady-state.
> 
> Notice also how the thread in wb on one cycle is the thread in fetch
> on the following cycle.  When a branch instruction is fetched, it's
> passed to decode where the condition is looked up and fed back to
> fetch.  At that point, we have as many cycles as we want to calculate
> the next address (be it the branch target or PC+1).  We can think
> about clever ways to store program counters in a dual-ported
> distributed RAM so that before one cycle before we need a PC, the
> correct value is already in the table.
> 
> stage                                                                         
>         
> 0     0       2       4       6       8       1       3       5       7       
> 9       0
> 1     9       0       2       4       6       8       1       3       5       
> 7       9
> 2     7       9       0       2       4       6       8       1       3       
> 5       7
> 3     5       7       9       0       2       4       6       8       1       
> 3       5
> 4     3       5       7       9       0       2       4       6       8       
> 1       3
> 5     1       3       5       7       9       0       2       4       6       
> 8       1
> 6     8       1       3       5       7       9       0       2       4       
> 6       8
> 7     6       8       1       3       5       7       9       0       2       
> 4       6
> 8     4       6       8       1       3       5       7       9       0       
> 2       4
> wb    2       4       6       8       1       3       5       7       9       
> 0       2

We can do it this way if we need to.  Then we must make sure that when a
thread is picked up after a load it is only plugged into a compatible
slot.  But, aren't all the BRAMs dual-ported, and can't we then use one
port for each task?
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to