The baseline architecture we have specified a barrel processor.
Several threads are scheduled to the same pipeline, and each cycle, we
issue an instruction from a different thread.  When there are more
active (not stalled) threads than pipeline stages, there are NO
pipeline hazards, there is no need for branch prediction, etc.  This
is great for utilization, except for memory accesses where synched
threads will all stall on memory at the same time.  Probably because
of this, mainstream GPUs do not implement barrel processors.

Once again, our growing pool of papers on GPUs can be found here:
https://sourceforge.net/p/openshader/wiki/gpu_links_and_articles/

And several of them describe thread scheduling, but if you look at
Section 2 of 
http://hal-ens-lyon.archives-ouvertes.fr/docs/00/69/36/32/PDF/sbiswi.pdf,
you get a compact explanation.

Extrapolating from this, we can specify an alternative scheduling
algorithm, sketched roughly here:

- Instructions are issued from the SAME thread until there is a stall
condition.  Stall conditions include unresolved unconditional
branches, memory accesses, and unresolved data dependencies.
- On a stall condition, we switch to the next thread that has a ready
instruction.
- Instruction results are tracked using a scoreboard (e.g. Tomasulo's algorithm)

This significantly complicates the pipeline, and we may want to make
some associated architectural changes:

- Result forwarding paths to shortcut dependencies
- A shorter integer pipeline, leading to out-of-order commitment

The main advantage here is more spread-out memory accesses, which is a
huge advantage considering that graphics workloads have poor temporal
locality, and the spatial locality is entirely sequential, meaning we
can only optimize using prefetch.  The risk with the barrel
architecture is that memory accesses may come in waves, where shaders,
en masse, will stall together waiting on memory.  This is very poor
for performance, because we don't get good utilization of either
memory or compute resources.  The challenge with a more sophisticated
scheduling algorithm is all the extra hardware we need to track
instruction dependencies.

With increasing levels of sophistication, we can compare these in the
simulator.  It would be nice to see if we can find ways to make the
barrel architecture perform well.  For instance, if the number of
threads is very large compared to the pipeline latency, then as memory
stalls occur, threads will temporarily drop out of the rotation.  But
before the number of active threads drops (too much) below the
pipeline length, memory data will have already started arriving.  Of
course, the whole point behind the simulator is to let us explore this
sort of thing.

-- 
Timothy Normand Miller, PhD
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to