In this case, I believe the CUDA compiler is smart enough to realize that there is no need to continually store and load a[i] from off-chip memory, instead it can just keep a local copy of a[i] in registers and iterate on it. The main reason to use shared memory is to conserve memory bandwidth, and since the CUDA compiler already eliminated the extra memory traffic, shared memory can't actually speed things up. A more complicated access pattern that the compiler can't statically analyze away would make shared memory necessary.
- bryan On Wed, Feb 10, 2010 at 7:50 AM, Chris Heuser <[email protected]> wrote: > Hello Everyone, > I recently saw (on this list) the conversation about comparing GPU speed to > CPU speed. The following code snippet was discussed: > > > import pycuda.driver as drv > import pycuda.tools > import pycuda.autoinit > import numpy > import numpy.linalg as la > from pycuda.compiler import SourceModule > blocks = 64 > block_size = 512 > nbr_values = blocks * block_size > n_iter = 100000 > ############# > > # GPU SECTION > > mod = SourceModule(""" > > __global__ void addone(float *dest, float *a, int n_iter) > > { > > const int i = blockDim.x*blockIdx.x + threadIdx.x; > > for(int n = 0; n < n_iter; n++) { > > a[i] = sin(a[i]); > > } > > dest[i] = a[i]; > > } > > """) > addone = mod.get_function("addone") > a = numpy.ones(nbr_values).astype(numpy.float32) > a += 1 # a is now an array of 2s > > dest = numpy.zeros_like(a) > start = drv.Event() > end = drv.Event() > start.record() > addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), > block=(block_size,1,1)) > #stop timer > > end.record() > end.synchronize() > secs = start.time_till(end)*1e-3 > print "GPU time:", secs > print "GPU result starts with...", dest[:3] > > > It was then suggested that this could be made even faster with the use of > shared memory. As I am currently trying to better understand shared memory, > I tried to just that: > > > > import pycuda.driver as drv > import pycuda.tools > import pycuda.autoinit > import numpy > import numpy.linalg as la > from pycuda.compiler import SourceModule > blocks = 64 > block_size = 512 > nbr_values = blocks * block_size > n_iter = 100000 > ############# > > # GPU SECTION > > mod = SourceModule(""" > > __global__ void addone(float *dest, float *a, int n_iter) > > { > > > > const int i = blockDim.x*blockIdx.x + threadIdx.x; > > > > __shared__ float A[512]; > > A[threadIdx.x] = a[i]; > > > > syncthreads(); > > > > for(int n = 0; n < n_iter; n++) { > > > > A[threadIdx.x] = sin(A[threadIdx.x]); > > > > } > > > > syncthreads(); > > dest[i] = A[threadIdx.x]; > > } > > """) > addone = mod.get_function("addone") > a = numpy.ones(nbr_values).astype(numpy.float32) > a += 1 # a is now an array of 2s > > dest = numpy.zeros_like(a) > start = drv.Event() > end = drv.Event() > start.record() > addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1), > block=(block_size,1,1)) > #stop timer > > end.record() > end.synchronize() > secs = start.time_till(end)*1e-3 > print "GPU time:", secs > print "GPU result starts with...", dest[:3] > > > > It seems simple enough, but for some reason the shared memory version is > slower. Not by much, but it is definitely slower. > Am I doing something wrong, or is the time to transfer to shared memory just > greater than the time saved by using shared memory? > > Thank you for your time! > ---Chris Heuser > _______________________________________________ > PyCUDA mailing list > [email protected] > http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net > > _______________________________________________ PyCUDA mailing list [email protected] http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net
