The baseline architecture we have specified a barrel processor. Several threads are scheduled to the same pipeline, and each cycle, we issue an instruction from a different thread. When there are more active (not stalled) threads than pipeline stages, there are NO pipeline hazards, there is no need for branch prediction, etc. This is great for utilization, except for memory accesses where synched threads will all stall on memory at the same time. Probably because of this, mainstream GPUs do not implement barrel processors.
Once again, our growing pool of papers on GPUs can be found here: https://sourceforge.net/p/openshader/wiki/gpu_links_and_articles/ And several of them describe thread scheduling, but if you look at Section 2 of http://hal-ens-lyon.archives-ouvertes.fr/docs/00/69/36/32/PDF/sbiswi.pdf, you get a compact explanation. Extrapolating from this, we can specify an alternative scheduling algorithm, sketched roughly here: - Instructions are issued from the SAME thread until there is a stall condition. Stall conditions include unresolved unconditional branches, memory accesses, and unresolved data dependencies. - On a stall condition, we switch to the next thread that has a ready instruction. - Instruction results are tracked using a scoreboard (e.g. Tomasulo's algorithm) This significantly complicates the pipeline, and we may want to make some associated architectural changes: - Result forwarding paths to shortcut dependencies - A shorter integer pipeline, leading to out-of-order commitment The main advantage here is more spread-out memory accesses, which is a huge advantage considering that graphics workloads have poor temporal locality, and the spatial locality is entirely sequential, meaning we can only optimize using prefetch. The risk with the barrel architecture is that memory accesses may come in waves, where shaders, en masse, will stall together waiting on memory. This is very poor for performance, because we don't get good utilization of either memory or compute resources. The challenge with a more sophisticated scheduling algorithm is all the extra hardware we need to track instruction dependencies. With increasing levels of sophistication, we can compare these in the simulator. It would be nice to see if we can find ways to make the barrel architecture perform well. For instance, if the number of threads is very large compared to the pipeline latency, then as memory stalls occur, threads will temporarily drop out of the rotation. But before the number of active threads drops (too much) below the pipeline length, memory data will have already started arriving. Of course, the whole point behind the simulator is to let us explore this sort of thing. -- Timothy Normand Miller, PhD http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
