Re: [Open-graphics] Another architecture debate: Write-back flag vs. bit bucket

Troy Benjegerdes Sun, 17 Mar 2013 20:34:01 -0700

> The bit bucket has been done a lot.  Not sure about the other one.  But
> some ISAs have NOT instructions to compensate.  We could always just do
> this:
> R1=SUBI(R0,1)


Okay, more than one bitbucket seems silly. But Yann must have had a reason
for mentioning more than one register.
 
> 
> >
> > Now, for one additional thought: What if we allow a special case for some
> > registers where we actually do the write-back, but turn off the
> > hazard/stall
> > logic?
> >
> 
> Not very useful with multithreading.  Having more threads than pipeline
> stages obviates this problem.

I'm not completely convinced of this. Suppose thread 1 is stalled on a
memory read, but thread 2 can dump the result required into Rs1 (it should
probably be called something slightly different than a regular register),

> 
> >
> > I can imagine some parallel code algorithms that really care about the
> > flags,
> > and might have 16 threads that do some calculation, and you really only
> > need
> > the result from one of those threads, and you don't care which one, and if
> > you know the result will be in R1, but you don't care who put it there.
> >
> 
> The idea of this kind of cross-thread communication not through a scratch
> pad gives me the willies.  Especially if we get a warp divergence, where we
> don't know the relative time of execution.  Also, how do you make sure that
> you get only one thread to produce the result without a warp divergence, or
> how is that different from just having every thread produce the same
> result?  It would only make sense if that were precomputed by an entirely
> different earlier kernel, in which case we'd use scratch memory.
> 
> 
> >
> > Something in what I said above makes sense in my head, and I'm hoping
> > someone
> > else either gets an idea, or can describe something usefull better than
> > what
> > I did above.
> >
> 
> Maybe I didn't get it.  :)

I didn't explain it very well.. Imagine we're doing some sort of sparse 
matrix (or graph search), and the first thread to get a hit lets us move on
to the next step, for *all* threads, and if we communicate through a register
rather than hitting address decoding and all that, we might save a few very
expensive (and stall-prone) pointer lookups into a graph/sparse matrix.

The piece that probably needs to be added for this crazy scheme to work is 
some sort of associated 'Rsc' (register shared context) that has some sort
of N-bit tag allocated by the operating system.

So the graph-search/sparse matrix library does a syscall asking for tag for
the shared register space, and if it's available, it gets .. say 1/2/3, so
it knows it is the only running program using RS1. If it doesn't get an 
allocation, fallback is to use the bitbucket R0 register, or the RS(n)
registers act just like the bitbucket.

The theory here is this effectively gives us a scratchpad without having to
involve the memory or cache subsystem.

This kind of functionality might be better abstracted into a more generic 
collective operation register set. It would be worth some conversations with
the UPC and MPI guys about whether or not having collective operations at 
the register level might help. ( see 
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978
)

My intuition is that there could be some substantial power savings possible in
applications that depend on a lot of synchronization. The state of the art
for clusters seemed to be having a thread (or every thread) polling a status
register on the network hardware as fast as it could.. I don't think that's 
the most power-effective way to handle things.

If you can come up with something that could implement a hardware assisted 
AllReduce on a GPU, this would be quite an improvement on state of the art.
(Have a look at info.ornl.gov/sites/publications/files/Pub24153.pdf for an
application that depends on collective performance a lot)


I think I might have just written a great PhD topic if I had the patience 
for academia ;) 
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Another architecture debate: Write-back flag vs. bit bucket

Reply via email to