It's been pointed out that memory accesses, especially when there's a
D-cache miss, can take hundreds of cycles.  This is one reason why a
lot of GPU designs won't switch threads vertically until they hit a
memory access.  Because of our deep pipeline, we want to cycle over
the threads round-robin.  With our deep pipeline, then, every thread
will execute one instruction every 16 cycles.  That should be long
enough to deal with most D-cache hits.  For misses, the entire
horizontal warp will block, reducing the number of threads until the
read is satisfied.  If the pipeline is, say, 9 stages, then you can
have 7 horizontal warps (24 threads) blocked before the pipeline gets
a bubble.  That accounts for 112 cycles.  Since we'll have separate
read-request (that enqueues a read) and read-get (that pops read data
and may block) instructions, for every useful instruction in between,
that's another 16 cycles of latency that can be absorbed.  If we can
make the compiler really damn smart, we can have request loops
separate from get loops, eliminating the problem entirely.

-- 
Timothy Normand Miller
http://www.cse.ohio-state.edu/~millerti
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to