Hi

You may want to hold out for a more authoritative response from someone
else, but I have noticed and write my code assuming that

- func() will launch the kernel and return (almost) immediately
- attempts to access gpuarrays involved in a launched kernel will block
until launched kernel has completed
- pycuda.driver.Context.synchronize can be called to explicitly wait for
kernel launch to complete (which is useful if you have two kernels
operating on same data, as they could otherwise run simultaneously)

cheers
Marmaduke

On Fri, Jul 6, 2012 at 11:39 AM, Orestis K <[email protected]> wrote:

>  Hello everyone!
>
> I'm new to PyCUDA and GPU programming however initial experiences have
> been very pleasant. I started out by some simple task and it seems blazing
> faster than running on a CPU. However, I would like to confirm that it's
> indeed as fast as it seems.
>
> My main question is whether after 'func' is called and access of the
> prompt is regained, are there still any of the tasks running on the GPU? If
> so, is there a way to block from performing the next tasks until it has
> finished?
>
> I've posted the code below for reference purposes. You can change the
> value of N so that it's faster. I set it very close to the limit so that I
> might witness a delay on returning control of the command prompt.
>
> Thank you in advance and please keep up the excellent work!
> -Orestis
>
> =================================================================
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import pycuda.gpuarray as gpuarray
>
> import sys, numpy, random, string
>
> # create random input data
> N = 33500000
> buf = ''.join(random.choice(string.ascii_uppercase +
> string.ascii_lowercase + string.digits) for x in xrange(N))
>
>
> mod = SourceModule("""
>   __global__ void get_words(int N, char *a,unsigned int *b)
>   {
>     int idx = blockIdx.x * blockDim.x + threadIdx.x;
>     if ( idx <N-3)
>         {
>         b[idx] = (a[idx] << 24) +  (a[idx+3]);
>         }
>   }
>   """)
> func = mod.get_function("get_words")
>
> # copy buffer to GPU
> bufArray = cuda.mem_alloc(N)
> cuda.memcpy_htod(bufArray, buf)
>
> # create results array on GPU
> resArray = gpuarray.to_gpu(numpy.zeros((N-3,1),dtype=numpy.int32))
>
> # setup parameters and execute function
> threadsPerBlock = 512
> blocksPerGrid = (N+threadsPerBlock-1)/threadsPerBlock
> func(numpy.int32(len(buf)), bufArray, resArray, grid=(blocksPerGrid ,1),
> block=(threadsPerBlock,1,1))
>
> # get back results
> a = numpy.zeros(N-3,1),dtype=numpy.int32)
> b = resArray.get(a)
> a = a.reshape(-1).tolist()
>
> _______________________________________________
> PyCUDA mailing list
> [email protected]
> http://lists.tiker.net/listinfo/pycuda
>
>
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to