Hello,

>Since ElementwiseKernels usually fetch each
  array entry exactly once, there likely isn't much in the way of
  savings to be had.

Are you sure?  In this set of applications, each array entry is read 4 times, 
minimum.  This is a stencil computation for 2D finite difference method.  This 
code is up and running and a CPU version was used to check results.

If the abstraction must fall by the wayside then it will fall.  The results 
come first.  I don't care about beauty of Elementwise for its own sake, 
although I can understand if you do (and I don't know that you do.) Incremental 
change is the best way forward.  Scrapping it and starting over again with 
SourceModule or worse pure CUDA C, is possibly unnecessary just to get access 
to a little bit of the shared memory today.  


What I probably failed to mention was that I'm not writing a library module for 
someone else to enjoy reading and using. Don't be afraid of breaking beautiful 
abstractions to make an application program more efficient, because maybe there 
was a reason for it at the end of the day. I will be the judge of that.


Thanks for further insight on shared memory access inside an ElementwiseKernel.



 




________________________________
 From: Andreas Kloeckner <[email protected]>
To: Geoffrey Anderson <[email protected]>; "[email protected]" 
<[email protected]> 
Sent: Sunday, April 21, 2013 6:19 PM
Subject: Re: [PyCUDA] shared memory as next step in performance with 
ElementwiseKernel
 

Geoffrey Anderson <[email protected]> writes:
> So I've got this program using Elementwise and I want to up the
> performance one more level.  Nobody to my knowledge has written about
> using shared memory, but that does not mean it can't be done in an
> Elementwise program.  How can shared memory be used in an elementwise
> program without completely rewriting the thing as SourceModule?  That
> is, how to get an incremental improvement in my existing
> ElementwiseKernel program, with the least code change? 
>
> I suspect shared memory is the key.  I have lots of array work in my
> program, naturally.  To use shared memory, I imagine that the program
> would need to detect how many i's per block there are, because shared
> memory is block scoped (by i I mean the magic i that's passed in by
> the pycuda system to an ElementwiseKernel), and this value would be
> used as the size of the array of shared memory to be allocated.  I'm
> also not sure which thread should allocate the memory; probably only
> one thread per block should do this but I don't know how that could be
> achieved. Is the thread having the index 0 for x be the key here?  And
> how would an ElementwiseKernel reference that x value?

Three comments on this:

- I feel like shared memory isn't a good fit for the abstraction
  presented by ElementwiseKernel, which deliberately hides various
  details from you, including the thread block size being used.  Since
  you really need to know about thread blocks to make use of shared
  memory, including it would make the abstraction (more) leaky.
  Not a good thing.

- ElementwiseKernel really isn't magic. :) All it does is paste your
  code into this here:

  https://github.com/inducer/pycuda/blob/master/pycuda/elementwise.py#L41 

  and then run the resulting kernel with a thread block size computed by
  pycuda.gpuarray.splay():

  https://github.com/inducer/pycuda/blob/master/pycuda/gpuarray.py#L109

- If your code fits into ElementwiseKernel, then I'm not sure you'll see
  much gain from using shared memory. Shared memory is good to help
  avoid redundant fetches. Since ElementwiseKernels usually fetch each
  array entry exactly once, there likely isn't much in the way of
  savings to be had.

HTH,
Andreas
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to