Re: [Open-graphics] Shader engine block diagram

Timothy Normand Miller Fri, 16 Oct 2009 07:19:00 -0700

On Fri, Oct 16, 2009 at 2:55 AM, Petter Urkedal <[email protected]> wrote:
> On 2009-10-15, Timothy Normand Miller wrote:
>> I've drawn this to illustrate the design of our shader engine:
>>
>> http://www.cse.ohio-state.edu/~millerti/shaderengine.pdf
>>
>> Several years ago, I designed an fp adder and an fp multiplier.  IIRC,
>> they required 6 pipeline stages.  Andre did something similar, but I
>> don't know what his results were.  Also, Andre mentioned that there
>> are sections of the pipelines for those units that are similar enough
>> that they could be combined.  Putting this together, we have a 9-stage
>> pipeline, which would imply 10 time slots for 10 (or more) tasks
>> assigned to the engine at once.  It's 10 because 1/2 a BRAM stores the
>> regfile for one task.
>
> If we have one "free" thread, shouldn't we instead add a 1-slot queue in
> front of the pipeline, so that the pipeline still runs at full capacity
> while we have one pending load?


Well, basically, we can try to just hop over the stalled thread,
selecting another one that is legal to issue at that time.  We will
have a scoreboard of which threads are in which pipeline stages (so
for instance we know which regfile to store in on writeback or the
reader tag when issuing loads).  We can easily use this to predict
which other thread is going to be in write-back at the time that any
thread we might issue will be in write-back.

This may be another premature optimization that we'll want to keep on
the to-do list.

>
>> Let's say that contexts 0 and 1 share a BRAM for regfile, and 2 is
>> with 3, etc.  The table below is an example of an order in which
>> instructions from these contexts could be issued so that we never have
>> a write-back happening to the same BRAM that's being read.  Each row
>> is a pipeline stage where 0 is fetch and 1 is decode/regfile. The row
>> labeled wb is the writeback which is _also_ stage 1.  The key is to
>> ensure that the thread number in wb does not share a RAM block with
>> the thread in decode.  The first column is the row labels, and the
>> remaining columns are the thread numbers of instructions in each of
>> those stages in steady-state.
>>
>> Notice also how the thread in wb on one cycle is the thread in fetch
>> on the following cycle.  When a branch instruction is fetched, it's
>> passed to decode where the condition is looked up and fed back to
>> fetch.  At that point, we have as many cycles as we want to calculate
>> the next address (be it the branch target or PC+1).  We can think
>> about clever ways to store program counters in a dual-ported
>> distributed RAM so that before one cycle before we need a PC, the
>> correct value is already in the table.
>>
>> stage
>> 0     0       2       4       6       8       1       3       5       7      
>>  9       0
>> 1     9       0       2       4       6       8       1       3       5      
>>  7       9
>> 2     7       9       0       2       4       6       8       1       3      
>>  5       7
>> 3     5       7       9       0       2       4       6       8       1      
>>  3       5
>> 4     3       5       7       9       0       2       4       6       8      
>>  1       3
>> 5     1       3       5       7       9       0       2       4       6      
>>  8       1
>> 6     8       1       3       5       7       9       0       2       4      
>>  6       8
>> 7     6       8       1       3       5       7       9       0       2      
>>  4       6
>> 8     4       6       8       1       3       5       7       9       0      
>>  2       4
>> wb    2       4       6       8       1       3       5       7       9      
>>  0       2
>
> We can do it this way if we need to.  Then we must make sure that when a
> thread is picked up after a load it is only plugged into a compatible
> slot.  But, aren't all the BRAMs dual-ported, and can't we then use one
> port for each task?

When a thread first passes through decode, it needs to read two
registers at once.  So both ports are busy.


-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Shader engine block diagram

Reply via email to