2013/3/18 Troy Benjegerdes <[email protected]>: >> > R0-R7: global, shared, constant, writeable only by host CPU >> > R8-R15: global, shared, scratchpad, writes are broadcast to all others >> > R16-RXX: regular, thread-context registers >> > >> > The compelling advantage over a memory scratchpad is that even though you >> > can 'hide' latency, *its still there*, you've just hidden the problem. >> > >> >> The only difference is energy, although lower energy is a solid argument. >> We want maximum throughput per unit area and maximum throughput per watt. >> (And incidentally, we often assume power and area are linearly related, >> for back-of-the-envelope calculations.) >> >> >> > >> > There's no latency or pipeline hazards on the writes, and register latency >> > on the reads. It would be excessively convenient to do a really clean >> > 'barrier()' implementation by writing to the broadcast/scratchpad register >> > and knowing that you will not see the result of the write until it has been >> > broadcast and visible to every other compute element. >> > >> >> I've investigated barriers before. See my Booster and VRSync papers. >> They're a pain all-around, and I'd rather we found ways to avoid them. I >> can see an argument for them in HPC workloads, but for graphics workloads, >> I think we should find another solution. > > > Personally, I think the solution is to include the voltage regulator on the > chip and tell it to turn on the juice a few cycles ahead of when all the > cores wake up. > > If the voltage regulator has lookahead into the barrier/broadcast sync logic > you should be able to know everyone is going to wake up (or is likely to wake > up), and boost the voltage ahead of, or even simultaneously to the power > spike. > > Given how often I see GPUs mentioned in the HPC context, designing only for > graphics workloads sounds like a bad idea.
Voltage regulator on a cpu sorry to tell you is simply a bad idea. The process isn't the same, a regulator produce a lot of heat or the most efficient one need large passive component that can't be integrate in an effective manner. Also enabling a look ahead to power up a regulator require some serious look-up in the sync logic. The workload speed and the regulator speed of reaction are totally different. Regulator we are talking tens of millisecond logic we are talking nanosecond cycle time. The time for the regulator modulation based on workload would be totally out of sync. What's could currently be done is voltage and frequency variation on _overall_ processor utilization. _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
