On Sun, Mar 17, 2013 at 11:30 PM, Troy Benjegerdes <[email protected]> wrote:

> > The bit bucket has been done a lot.  Not sure about the other one.  But
> > some ISAs have NOT instructions to compensate.  We could always just do
> > this:
> > R1=SUBI(R0,1)
>
> Okay, more than one bitbucket seems silly. But Yann must have had a reason
> for mentioning more than one register.
>
> >
> > >
> > > Now, for one additional thought: What if we allow a special case for
> some
> > > registers where we actually do the write-back, but turn off the
> > > hazard/stall
> > > logic?
> > >
> >
> > Not very useful with multithreading.  Having more threads than pipeline
> > stages obviates this problem.
>
> I'm not completely convinced of this. Suppose thread 1 is stalled on a
> memory read, but thread 2 can dump the result required into Rs1 (it should
> probably be called something slightly different than a regular register),
>

Keep in mind that different threads don't share logical register files.
 How do you propose that we make one thread access another's RF?  And how
is that different from accessing a shared scratch memory?  And you still
have the problem of the relative timing of the threads not being
predictable under some circumstances.  What if the producer ends up running
later than the consumers?

It's true that memory stalls can be very lengthy.  But we can't count on
that.  GPUs have cache hierarchies.


>
> >
> > >
> > > I can imagine some parallel code algorithms that really care about the
> > > flags,
> > > and might have 16 threads that do some calculation, and you really only
> > > need
> > > the result from one of those threads, and you don't care which one,
> and if
> > > you know the result will be in R1, but you don't care who put it there.
> > >
> >
> > The idea of this kind of cross-thread communication not through a scratch
> > pad gives me the willies.  Especially if we get a warp divergence, where
> we
> > don't know the relative time of execution.  Also, how do you make sure
> that
> > you get only one thread to produce the result without a warp divergence,
> or
> > how is that different from just having every thread produce the same
> > result?  It would only make sense if that were precomputed by an entirely
> > different earlier kernel, in which case we'd use scratch memory.
> >
> >
> > >
> > > Something in what I said above makes sense in my head, and I'm hoping
> > > someone
> > > else either gets an idea, or can describe something usefull better than
> > > what
> > > I did above.
> > >
> >
> > Maybe I didn't get it.  :)
>
> I didn't explain it very well.. Imagine we're doing some sort of sparse
> matrix (or graph search), and the first thread to get a hit lets us move on
> to the next step, for *all* threads, and if we communicate through a
> register
> rather than hitting address decoding and all that, we might save a few very
> expensive (and stall-prone) pointer lookups into a graph/sparse matrix.
>

There's buckets of research on this.  I assume you're at least somewhat
familiar with it.  Anyhow, the communication path you're talking about
doesn't exist.  The scratch memory is the mechanism that serves this
purpose.  We can make this have the save _effective_ latency as hitting the
RF.


>
> The piece that probably needs to be added for this crazy scheme to work is
> some sort of associated 'Rsc' (register shared context) that has some sort
> of N-bit tag allocated by the operating system.
>

We could window the registers, kinda like SPARC, where some registers are
shared among SPs in the SM.  But with dynamic kernel scheduling, you can't
predict anything about which thread does what.

And then there's the fact that what you're describing counts on there being
a warp divergence, which makes things even more challenging.


>
> So the graph-search/sparse matrix library does a syscall asking for tag for
> the shared register space, and if it's available, it gets .. say 1/2/3, so
> it knows it is the only running program using RS1. If it doesn't get an
> allocation, fallback is to use the bitbucket R0 register, or the RS(n)
> registers act just like the bitbucket.
>

Syscall?  GPUs don't support syscalls.  :)  Hell, they don't support
exceptions of any kind.  You're thinking about this WAY too much like a
CPU.  Let me put it this way.  You think of a 16-core system as being able
to run many processes utilizing pre-emtive multitasking.  Nothing of the
sort exists.  A high-end nVidia GPU will support tens of thousands of
threads SIMULTANEOUSLY.  But "threads" is misleading, because each kernel
runs very briefly, to the scheduler's job is to figure out when kernels
complete and start more.  You really gotta read the papers I put up on the
OpenShader wiki.

The first five papers are introductory and are in the best sequence to read
them:
https://sourceforge.net/p/openshader/wiki/gpu_links_and_articles/

Anyhow, what you're describing would surely require a hardware mechanism
and a special instruction.



>
> The theory here is this effectively gives us a scratchpad without having to
> involve the memory or cache subsystem.
>

I think we can do that anyhow.  Read the papers I pointed to; when you
realize that the median kernel size is maybe a dozen instructions, you may
rethink this.


>
> This kind of functionality might be better abstracted into a more generic
> collective operation register set. It would be worth some conversations
> with
> the UPC and MPI guys about whether or not having collective operations at
> the register level might help. ( see
>
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=5166978
> )
>

Let's consider a more general extension of this.  It's VERY common to
extend the logical register set to include a table of shared constants.
 GPU ISAs can avoid immediates this way.  If we generalize this a bit, we
can get a small register scratch-pad as well, just by making some of the
shared ones writable.  (Or all of them; security is not an issue at this
level.)  But putting aside that you have a terrible bottleneck here, the
latency to access a memory-mapped scratchpad is functionally the same,
because we hide any latency shorter than the pipeline by running dozens of
threads on the same SM.


>
> My intuition is that there could be some substantial power savings
> possible in
> applications that depend on a lot of synchronization. The state of the art
> for clusters seemed to be having a thread (or every thread) polling a
> status
> register on the network hardware as fast as it could.. I don't think that's
> the most power-effective way to handle things.
>

True.  There MIGHT be some energy savings.  And that's VERY important.  But
there are several fundamental problems that have to be solved first before
we can make a convincing argument that this (a) is a better solution, (b)
can be used effectively, (c) has many use cases, and (d) can be used by any
compiler.


>
> If you can come up with something that could implement a hardware assisted
> AllReduce on a GPU, this would be quite an improvement on state of the art.
> (Have a look at info.ornl.gov/sites/publications/files/Pub24153.pdf for an
> application that depends on collective performance a lot)
>

HPC isn't my area, so I don't know much about allreduce.  I just looked it
up, and one page says it's functionally equivalent to a reduce followed by
a broadcast.  I really have to think about how that would translate from a
supercomputer (e.g. a PGAS machine) to a GPU.


>
>
> I think I might have just written a great PhD topic if I had the patience
> for academia ;)
>

Believe me, it takes a LOT of patience.  And discipline and focus and OCD
and limited sanity, etc.  :)


-- 
Timothy Normand Miller, PhD
Assistant Professor of Computer Science, Binghamton University
http://www.cs.binghamton.edu/~millerti/
Open Graphics Project
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to