On Mon, Mar 18, 2013 at 10:59 AM, Troy Benjegerdes <[email protected]> wrote:
> > > This kind of functionality might be better abstracted into a more > generic > > > collective operation register set. It would be worth some conversations > > > with > > > the UPC and MPI guys about whether or not having collective operations > at > > > the register level might help. ( see > > > > > > > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978 > > > ) > > > > > > > Let's consider a more general extension of this. It's VERY common to > > extend the logical register set to include a table of shared constants. > > GPU ISAs can avoid immediates this way. If we generalize this a bit, we > > can get a small register scratch-pad as well, just by making some of the > > shared ones writable. (Or all of them; security is not an issue at this > > level.) But putting aside that you have a terrible bottleneck here, the > > latency to access a memory-mapped scratchpad is functionally the same, > > because we hide any latency shorter than the pipeline by running dozens > of > > threads on the same SM. > > > How about we take half the register set address space, and use it for > shared > constants and a global scratchpad, so something like this: > > R0-R7: global, shared, constant, writeable only by host CPU > R8-R15: global, shared, scratchpad, writes are broadcast to all others > R16-RXX: regular, thread-context registers > > The compelling advantage over a memory scratchpad is that even though you > can 'hide' latency, *its still there*, you've just hidden the problem. > The only difference is energy, although lower energy is a solid argument. We want maximum throughput per unit area and maximum throughput per watt. (And incidentally, we often assume power and area are linearly related, for back-of-the-envelope calculations.) > > There's no latency or pipeline hazards on the writes, and register latency > on the reads. It would be excessively convenient to do a really clean > 'barrier()' implementation by writing to the broadcast/scratchpad register > and knowing that you will not see the result of the write until it has been > broadcast and visible to every other compute element. > I've investigated barriers before. See my Booster and VRSync papers. They're a pain all-around, and I'd rather we found ways to avoid them. I can see an argument for them in HPC workloads, but for graphics workloads, I think we should find another solution. > > The latency of the broadcast operation should be the same or similiar to > write-followed by read of a memory scratchpad. > > > You mention bottleneck.. Can you explain that more? If you have multiple > threads writing to R8 (shared scratchpad), you can either > > a) bitbucket all but the last write > b) do something really interesting like run it through the ALU for a > summation or logical OR > The bottleneck applies (in a practical sense) to reads. There may be competition for access to this shared resource. (Sure, writes can compete, but let's assume those happen a lot less.) -- Timothy Normand Miller, PhD Assistant Professor of Computer Science, Binghamton University http://www.cs.binghamton.edu/~millerti/ Open Graphics Project
_______________________________________________ Open-graphics mailing list [email protected] http://lists.duskglow.com/mailman/listinfo/open-graphics List service provided by Duskglow Consulting, LLC (www.duskglow.com)
