Re: [Open-graphics] Another architecture debate: Write-back flag vs. bit bucket

Troy Benjegerdes Mon, 18 Mar 2013 08:04:05 -0700

> > This kind of functionality might be better abstracted into a more generic
> > collective operation register set. It would be worth some conversations
> > with
> > the UPC and MPI guys about whether or not having collective operations at
> > the register level might help. ( see
> >
> > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978
> > )
> >
> 
> Let's consider a more general extension of this.  It's VERY common to
> extend the logical register set to include a table of shared constants.
>  GPU ISAs can avoid immediates this way.  If we generalize this a bit, we
> can get a small register scratch-pad as well, just by making some of the
> shared ones writable.  (Or all of them; security is not an issue at this
> level.)  But putting aside that you have a terrible bottleneck here, the
> latency to access a memory-mapped scratchpad is functionally the same,
> because we hide any latency shorter than the pipeline by running dozens of
> threads on the same SM.



How about we take half the register set address space, and use it for shared
constants and a global scratchpad, so something like this:

R0-R7: global, shared, constant, writeable only by host CPU
R8-R15: global, shared, scratchpad, writes are broadcast to all others
R16-RXX: regular, thread-context registers

The compelling advantage over a memory scratchpad is that even though you
can 'hide' latency, *its still there*, you've just hidden the problem.

There's no latency or pipeline hazards on the writes, and register latency
on the reads. It would be excessively convenient to do a really clean 
'barrier()' implementation by writing to the broadcast/scratchpad register
and knowing that you will not see the result of the write until it has been
broadcast and visible to every other compute element.

The latency of the broadcast operation should be the same or similiar to
write-followed by read of a memory scratchpad.


You mention bottleneck.. Can you explain that more? If you have multiple
threads writing to R8 (shared scratchpad), you can either

a) bitbucket all but the last write
b) do something really interesting like run it through the ALU for a
   summation or logical OR
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Another architecture debate: Write-back flag vs. bit bucket

Reply via email to