Suppose I have a column-major array stored in linear memory on the gpu, and want to run a kernel on one column.

One way would be to pass the base pointer and the offset into the kernel as parameters and do the same addition in each thread of my kernel. This seems unnecessarily inefficient.

Another way would be to allocate each column separately and keep around vectors of pointers to columns for kernels that need to process the whole array. This seems like a mess.

When calling a kernel from C the handles to device arrays are just addresses into device memory, and you can just apply the offset before calling the kernel, and this seems like the right way to go about it for C.

What's the right way to do this in a pycuda context?

I found something promising in one of the tests:

        # now try with offsets
        dest = numpy.zeros_like(a)
        multiply_them(
                drv.Out(dest), numpy.intp(a_gpu)+1, b_gpu,
                block=(399,1,1))

and I found the same syntax in:

http://documen.tician.de/pycuda/tutorial.html?highlight=intp#structures

but this doesn't seem to be explicitly documented.

Thanks!
Drew


_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

Reply via email to