On Mon, Mar 18, 2013 at 10:59 AM, Troy Benjegerdes <[email protected]> wrote:

> > > This kind of functionality might be better abstracted into a more
> generic
> > > collective operation register set. It would be worth some conversations
> > > with
> > > the UPC and MPI guys about whether or not having collective operations
> at
> > > the register level might help. ( see
> > >
> > >
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978
> > > )
> > >
> >
> > Let's consider a more general extension of this.  It's VERY common to
> > extend the logical register set to include a table of shared constants.
> >  GPU ISAs can avoid immediates this way.  If we generalize this a bit, we
> > can get a small register scratch-pad as well, just by making some of the
> > shared ones writable.  (Or all of them; security is not an issue at this
> > level.)  But putting aside that you have a terrible bottleneck here, the
> > latency to access a memory-mapped scratchpad is functionally the same,
> > because we hide any latency shorter than the pipeline by running dozens
> of
> > threads on the same SM.
>
>
> How about we take half the register set address space, and use it for
> shared
> constants and a global scratchpad, so something like this:
>
> R0-R7: global, shared, constant, writeable only by host CPU
> R8-R15: global, shared, scratchpad, writes are broadcast to all others
> R16-RXX: regular, thread-context registers
>
> The compelling advantage over a memory scratchpad is that even though you
> can 'hide' latency, *its still there*, you've just hidden the problem.
>

The only difference is energy, although lower energy is a solid argument.
 We want maximum throughput per unit area and maximum throughput per watt.
 (And incidentally, we often assume power and area are linearly related,
for back-of-the-envelope calculations.)


>
> There's no latency or pipeline hazards on the writes, and register latency
> on the reads. It would be excessively convenient to do a really clean
> 'barrier()' implementation by writing to the broadcast/scratchpad register
> and knowing that you will not see the result of the write until it has been
> broadcast and visible to every other compute element.
>

I've investigated barriers before.  See my Booster and VRSync papers.
 They're a pain all-around, and I'd rather we found ways to avoid them.  I
can see an argument for them in HPC workloads, but for graphics workloads,
I think we should find another solution.


>
> The latency of the broadcast operation should be the same or similiar to
> write-followed by read of a memory scratchpad.
>
>
> You mention bottleneck.. Can you explain that more? If you have multiple
> threads writing to R8 (shared scratchpad), you can either
>
> a) bitbucket all but the last write
> b) do something really interesting like run it through the ALU for a
>    summation or logical OR
>

The bottleneck applies (in a practical sense) to reads.  There may be
competition for access to this shared resource.  (Sure, writes can compete,
but let's assume those happen a lot less.)


-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to