Suppose I have a column-major array stored in linear memory on the
gpu, and want to run a kernel on one column.
One way would be to pass the base pointer and the offset into the
kernel as parameters and do the same addition in each thread of my
kernel. This seems unnecessarily inefficient.
Another way would be to allocate each column separately and keep
around vectors of pointers to columns for kernels that need to process
the whole array. This seems like a mess.
When calling a kernel from C the handles to device arrays are just
addresses into device memory, and you can just apply the offset before
calling the kernel, and this seems like the right way to go about it
for C.
What's the right way to do this in a pycuda context?
I found something promising in one of the tests:
# now try with offsets
dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), numpy.intp(a_gpu)+1, b_gpu,
block=(399,1,1))
and I found the same syntax in:
http://documen.tician.de/pycuda/tutorial.html?highlight=intp#structures
but this doesn't seem to be explicitly documented.
Thanks!
Drew
_______________________________________________
PyCUDA mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net