Hello,
>Since ElementwiseKernels usually fetch each
array entry exactly once, there likely isn't much in the way of
savings to be had.
Are you sure? In this set of applications, each array entry is read 4 times,
minimum. This is a stencil computation for 2D finite difference method. This
code is up and running and a CPU version was used to check results.
If the abstraction must fall by the wayside then it will fall. The results
come first. I don't care about beauty of Elementwise for its own sake,
although I can understand if you do (and I don't know that you do.) Incremental
change is the best way forward. Scrapping it and starting over again with
SourceModule or worse pure CUDA C, is possibly unnecessary just to get access
to a little bit of the shared memory today.
What I probably failed to mention was that I'm not writing a library module for
someone else to enjoy reading and using. Don't be afraid of breaking beautiful
abstractions to make an application program more efficient, because maybe there
was a reason for it at the end of the day. I will be the judge of that.
Thanks for further insight on shared memory access inside an ElementwiseKernel.
________________________________
From: Andreas Kloeckner <[email protected]>
To: Geoffrey Anderson <[email protected]>; "[email protected]"
<[email protected]>
Sent: Sunday, April 21, 2013 6:19 PM
Subject: Re: [PyCUDA] shared memory as next step in performance with
ElementwiseKernel
Geoffrey Anderson <[email protected]> writes:
> So I've got this program using Elementwise and I want to up the
> performance one more level. Nobody to my knowledge has written about
> using shared memory, but that does not mean it can't be done in an
> Elementwise program. How can shared memory be used in an elementwise
> program without completely rewriting the thing as SourceModule? That
> is, how to get an incremental improvement in my existing
> ElementwiseKernel program, with the least code change?
>
> I suspect shared memory is the key. I have lots of array work in my
> program, naturally. To use shared memory, I imagine that the program
> would need to detect how many i's per block there are, because shared
> memory is block scoped (by i I mean the magic i that's passed in by
> the pycuda system to an ElementwiseKernel), and this value would be
> used as the size of the array of shared memory to be allocated. I'm
> also not sure which thread should allocate the memory; probably only
> one thread per block should do this but I don't know how that could be
> achieved. Is the thread having the index 0 for x be the key here? And
> how would an ElementwiseKernel reference that x value?
Three comments on this:
- I feel like shared memory isn't a good fit for the abstraction
presented by ElementwiseKernel, which deliberately hides various
details from you, including the thread block size being used. Since
you really need to know about thread blocks to make use of shared
memory, including it would make the abstraction (more) leaky.
Not a good thing.
- ElementwiseKernel really isn't magic. :) All it does is paste your
code into this here:
https://github.com/inducer/pycuda/blob/master/pycuda/elementwise.py#L41
and then run the resulting kernel with a thread block size computed by
pycuda.gpuarray.splay():
https://github.com/inducer/pycuda/blob/master/pycuda/gpuarray.py#L109
- If your code fits into ElementwiseKernel, then I'm not sure you'll see
much gain from using shared memory. Shared memory is good to help
avoid redundant fetches. Since ElementwiseKernels usually fetch each
array entry exactly once, there likely isn't much in the way of
savings to be had.
HTH,
Andreas
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda