So I've got this program using Elementwise and I want to up the performance one 
more level.  Nobody to my knowledge has written about using shared memory, but 
that does not mean it can't be done in an Elementwise program.  How can shared 
memory be used in an elementwise program without completely rewriting the thing 
as SourceModule?  That is, how to get an incremental improvement in my existing 
ElementwiseKernel program, with the least code change?  


I suspect shared memory is the key.  I have lots of array work in my program, 
naturally.  To use shared memory, I imagine that the program would need to 
detect how many i's per block there are, because shared memory is block scoped 
(by i I mean the magic i that's passed in by the pycuda system to an 
ElementwiseKernel), and this value would be used as the size of the array of 
shared memory to be allocated.  I'm also not sure which thread should allocate 
the memory; probably only one thread per block should do this but I don't know 
how that could be achieved. Is the thread having the index 0 for x be the key 
here?  And how would an ElementwiseKernel reference that x value?


Would any of the cuda wizkids like to propose how a program might detect the 
number of i's in a block of ElementwiseKernel, and show how to use the shared 
memory in it?

 
Regards,


ga
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to