> > This kind of functionality might be better abstracted into a more generic > > collective operation register set. It would be worth some conversations > > with > > the UPC and MPI guys about whether or not having collective operations at > > the register level might help. ( see > > > > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978 > > ) > > > > Let's consider a more general extension of this. It's VERY common to > extend the logical register set to include a table of shared constants. > GPU ISAs can avoid immediates this way. If we generalize this a bit, we > can get a small register scratch-pad as well, just by making some of the > shared ones writable. (Or all of them; security is not an issue at this > level.) But putting aside that you have a terrible bottleneck here, the > latency to access a memory-mapped scratchpad is functionally the same, > because we hide any latency shorter than the pipeline by running dozens of > threads on the same SM.
How about we take half the register set address space, and use it for shared constants and a global scratchpad, so something like this: R0-R7: global, shared, constant, writeable only by host CPU R8-R15: global, shared, scratchpad, writes are broadcast to all others R16-RXX: regular, thread-context registers The compelling advantage over a memory scratchpad is that even though you can 'hide' latency, *its still there*, you've just hidden the problem. There's no latency or pipeline hazards on the writes, and register latency on the reads. It would be excessively convenient to do a really clean 'barrier()' implementation by writing to the broadcast/scratchpad register and knowing that you will not see the result of the write until it has been broadcast and visible to every other compute element. The latency of the broadcast operation should be the same or similiar to write-followed by read of a memory scratchpad. You mention bottleneck.. Can you explain that more? If you have multiple threads writing to R8 (shared scratchpad), you can either a) bitbucket all but the last write b) do something really interesting like run it through the ALU for a summation or logical OR _______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
