Geoffrey Anderson <[email protected]> writes: > So I've got this program using Elementwise and I want to up the > performance one more level. Nobody to my knowledge has written about > using shared memory, but that does not mean it can't be done in an > Elementwise program. How can shared memory be used in an elementwise > program without completely rewriting the thing as SourceModule? That > is, how to get an incremental improvement in my existing > ElementwiseKernel program, with the least code change? > > I suspect shared memory is the key. I have lots of array work in my > program, naturally. To use shared memory, I imagine that the program > would need to detect how many i's per block there are, because shared > memory is block scoped (by i I mean the magic i that's passed in by > the pycuda system to an ElementwiseKernel), and this value would be > used as the size of the array of shared memory to be allocated. I'm > also not sure which thread should allocate the memory; probably only > one thread per block should do this but I don't know how that could be > achieved. Is the thread having the index 0 for x be the key here? And > how would an ElementwiseKernel reference that x value?
Three comments on this: - I feel like shared memory isn't a good fit for the abstraction presented by ElementwiseKernel, which deliberately hides various details from you, including the thread block size being used. Since you really need to know about thread blocks to make use of shared memory, including it would make the abstraction (more) leaky. Not a good thing. - ElementwiseKernel really isn't magic. :) All it does is paste your code into this here: https://github.com/inducer/pycuda/blob/master/pycuda/elementwise.py#L41 and then run the resulting kernel with a thread block size computed by pycuda.gpuarray.splay(): https://github.com/inducer/pycuda/blob/master/pycuda/gpuarray.py#L109 - If your code fits into ElementwiseKernel, then I'm not sure you'll see much gain from using shared memory. Shared memory is good to help avoid redundant fetches. Since ElementwiseKernels usually fetch each array entry exactly once, there likely isn't much in the way of savings to be had. HTH, Andreas _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
