Re: [PyCUDA] Shared Memory Question

Bryan Catanzaro Wed, 10 Feb 2010 09:42:10 -0800

In this case, I believe the CUDA compiler is smart enough to realize
that there is no need to continually store and load a[i] from off-chip
memory, instead it can just keep a local copy of a[i] in registers and
iterate on it.  The main reason to use shared memory is to conserve
memory bandwidth, and since the CUDA compiler already eliminated the
extra memory traffic, shared memory can't actually speed things up.  A
more complicated access pattern that the compiler can't statically
analyze away would make shared memory necessary.


- bryan

On Wed, Feb 10, 2010 at 7:50 AM, Chris Heuser
<[email protected]> wrote:
> Hello Everyone,
> I recently saw (on this list) the conversation about comparing GPU speed to
> CPU speed. The following code snippet was discussed:
>
>
> import pycuda.driver as drv
> import pycuda.tools
> import pycuda.autoinit
> import numpy
> import numpy.linalg as la
> from pycuda.compiler import SourceModule
> blocks = 64
> block_size = 512
> nbr_values = blocks * block_size
> n_iter = 100000
> #############
>
> # GPU SECTION
>
> mod = SourceModule("""
>
> __global__ void addone(float *dest, float *a, int n_iter)
>
> {
>
>  const int i = blockDim.x*blockIdx.x + threadIdx.x;
>
>  for(int n = 0; n < n_iter; n++) {
>
>    a[i] = sin(a[i]);
>
>  }
>
>  dest[i] = a[i];
>
> }
>
> """)
> addone = mod.get_function("addone")
> a = numpy.ones(nbr_values).astype(numpy.float32)
> a += 1 # a is now an array of 2s
>
> dest = numpy.zeros_like(a)
> start = drv.Event()
> end = drv.Event()
> start.record()
> addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1),
> block=(block_size,1,1))
> #stop timer
>
> end.record()
> end.synchronize()
> secs = start.time_till(end)*1e-3
> print "GPU time:", secs
> print "GPU result starts with...", dest[:3]
>
>
> It was then suggested that this could be made even faster with the use of
> shared memory. As I am currently trying to better understand shared memory,
> I tried to just that:
>
>
>
> import pycuda.driver as drv
> import pycuda.tools
> import pycuda.autoinit
> import numpy
> import numpy.linalg as la
> from pycuda.compiler import SourceModule
> blocks = 64
> block_size = 512
> nbr_values = blocks * block_size
> n_iter = 100000
> #############
>
> # GPU SECTION
>
> mod = SourceModule("""
>
> __global__ void addone(float *dest, float *a, int n_iter)
>
> {
>
>
>
>  const int i = blockDim.x*blockIdx.x + threadIdx.x;
>
>
>
>  __shared__ float A[512];
>
>  A[threadIdx.x] = a[i];
>
>
>
>  syncthreads();
>
>
>
>  for(int n = 0; n < n_iter; n++) {
>
>
>
>    A[threadIdx.x] = sin(A[threadIdx.x]);
>
>
>
>  }
>
>
>
>  syncthreads();
>
>  dest[i] = A[threadIdx.x];
>
> }
>
> """)
> addone = mod.get_function("addone")
> a = numpy.ones(nbr_values).astype(numpy.float32)
> a += 1 # a is now an array of 2s
>
> dest = numpy.zeros_like(a)
> start = drv.Event()
> end = drv.Event()
> start.record()
> addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1),
> block=(block_size,1,1))
> #stop timer
>
> end.record()
> end.synchronize()
> secs = start.time_till(end)*1e-3
> print "GPU time:", secs
> print "GPU result starts with...", dest[:3]
>
>
>
> It seems simple enough, but for some reason the shared memory version is
> slower. Not by much, but it is definitely slower.
> Am I doing something wrong, or is the time to transfer to shared memory just
> greater than the time saved by using shared memory?
>
> Thank you for your time!
>     ---Chris Heuser
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net
>
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net

Re: [PyCUDA] Shared Memory Question

Reply via email to