> > R0-R7: global, shared, constant, writeable only by host CPU > > R8-R15: global, shared, scratchpad, writes are broadcast to all others > > R16-RXX: regular, thread-context registers > > > > The compelling advantage over a memory scratchpad is that even though you > > can 'hide' latency, *its still there*, you've just hidden the problem. > > > > The only difference is energy, although lower energy is a solid argument. > We want maximum throughput per unit area and maximum throughput per watt. > (And incidentally, we often assume power and area are linearly related, > for back-of-the-envelope calculations.) > > > > > > There's no latency or pipeline hazards on the writes, and register latency > > on the reads. It would be excessively convenient to do a really clean > > 'barrier()' implementation by writing to the broadcast/scratchpad register > > and knowing that you will not see the result of the write until it has been > > broadcast and visible to every other compute element. > > > > I've investigated barriers before. See my Booster and VRSync papers. > They're a pain all-around, and I'd rather we found ways to avoid them. I > can see an argument for them in HPC workloads, but for graphics workloads, > I think we should find another solution.
Personally, I think the solution is to include the voltage regulator on the chip and tell it to turn on the juice a few cycles ahead of when all the cores wake up. If the voltage regulator has lookahead into the barrier/broadcast sync logic you should be able to know everyone is going to wake up (or is likely to wake up), and boost the voltage ahead of, or even simultaneously to the power spike. Given how often I see GPUs mentioned in the HPC context, designing only for graphics workloads sounds like a bad idea. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
