So I've got this program using Elementwise and I want to up the performance one
more level. Nobody to my knowledge has written about using shared memory, but
that does not mean it can't be done in an Elementwise program. How can shared
memory be used in an elementwise program without completely rewriting the thing
as SourceModule? That is, how to get an incremental improvement in my existing
ElementwiseKernel program, with the least code change?
I suspect shared memory is the key. I have lots of array work in my program,
naturally. To use shared memory, I imagine that the program would need to
detect how many i's per block there are, because shared memory is block scoped
(by i I mean the magic i that's passed in by the pycuda system to an
ElementwiseKernel), and this value would be used as the size of the array of
shared memory to be allocated. I'm also not sure which thread should allocate
the memory; probably only one thread per block should do this but I don't know
how that could be achieved. Is the thread having the index 0 for x be the key
here? And how would an ElementwiseKernel reference that x value?
Would any of the cuda wizkids like to propose how a program might detect the
number of i's in a block of ElementwiseKernel, and show how to use the shared
memory in it?
Regards,
ga
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda