Geoffrey Anderson <[email protected]> writes:
> So I've got this program using Elementwise and I want to up the
> performance one more level.  Nobody to my knowledge has written about
> using shared memory, but that does not mean it can't be done in an
> Elementwise program.  How can shared memory be used in an elementwise
> program without completely rewriting the thing as SourceModule?  That
> is, how to get an incremental improvement in my existing
> ElementwiseKernel program, with the least code change? 
>
> I suspect shared memory is the key.  I have lots of array work in my
> program, naturally.  To use shared memory, I imagine that the program
> would need to detect how many i's per block there are, because shared
> memory is block scoped (by i I mean the magic i that's passed in by
> the pycuda system to an ElementwiseKernel), and this value would be
> used as the size of the array of shared memory to be allocated.  I'm
> also not sure which thread should allocate the memory; probably only
> one thread per block should do this but I don't know how that could be
> achieved. Is the thread having the index 0 for x be the key here?  And
> how would an ElementwiseKernel reference that x value?

Three comments on this:

- I feel like shared memory isn't a good fit for the abstraction
  presented by ElementwiseKernel, which deliberately hides various
  details from you, including the thread block size being used.  Since
  you really need to know about thread blocks to make use of shared
  memory, including it would make the abstraction (more) leaky.
  Not a good thing.

- ElementwiseKernel really isn't magic. :) All it does is paste your
  code into this here:

  https://github.com/inducer/pycuda/blob/master/pycuda/elementwise.py#L41 

  and then run the resulting kernel with a thread block size computed by
  pycuda.gpuarray.splay():

  https://github.com/inducer/pycuda/blob/master/pycuda/gpuarray.py#L109

- If your code fits into ElementwiseKernel, then I'm not sure you'll see
  much gain from using shared memory. Shared memory is good to help
  avoid redundant fetches. Since ElementwiseKernels usually fetch each
  array entry exactly once, there likely isn't much in the way of
  savings to be had.

HTH,
Andreas

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to