It's been pointed out that memory accesses, especially when there's a D-cache miss, can take hundreds of cycles. This is one reason why a lot of GPU designs won't switch threads vertically until they hit a memory access. Because of our deep pipeline, we want to cycle over the threads round-robin. With our deep pipeline, then, every thread will execute one instruction every 16 cycles. That should be long enough to deal with most D-cache hits. For misses, the entire horizontal warp will block, reducing the number of threads until the read is satisfied. If the pipeline is, say, 9 stages, then you can have 7 horizontal warps (24 threads) blocked before the pipeline gets a bubble. That accounts for 112 cycles. Since we'll have separate read-request (that enqueues a read) and read-get (that pops read data and may block) instructions, for every useful instruction in between, that's another 16 cycles of latency that can be absorbed. If we can make the compiler really damn smart, we can have request loops separate from get loops, eliminating the problem entirely.
-- Timothy Normand Miller http://www.cse.ohio-state.edu/~millerti Open Graphics Project _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
