On Wed, Nov 7, 2012 at 9:53 PM, Lev Givon <[email protected]> wrote: > Received from Ahmed Fasih on Wed, Nov 07, 2012 at 09:02:15PM EST: >> Hi folks, please take my continued stream of questions as an indication >> of the versatility and flexibility of PyCUDA, instead of my slowness >> with it. >> >> I have PyCUDA that allocates memory on the host and on the device to >> store the output of a kernel calculation. E.g., >> >> ### start code >> import numpy >> import pycuda.driver as cuda >> import pycuda.autoinit >> image = numpy.zeros((1024, 1024), dtype=numpy.complex64, order='C') >> image_gpu = cuda.mem_alloc(image.nbytes) >> >> # kernel invocation using image_gpu, works fine. >> ### end code >> >> I'd like to replace this idiom with one involving page-locked memory, >> specifically device-mapped memory. I have something as follows: >> >> ### start code >> image = cuda.aligned_empty((1024,1024), dtype=numpy.complex64, >> order='C') >> image_gpu = cuda.register_host_memory(image, >> flags=cuda.mem_host_register_flags.DEVICEMAP) >> ### end code >> >> Same kernel invocation using image_gpu fails, >> "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value" >> >> I also try sending the kernel the integer (pointer) returned by >> register_host_memory()'s returned value's basemap's >> get_device_pointer(). In other words, I tried this: >> >> ### start code >> image_gpu_return = cuda.register_host_memory(self.image, >> flags=cuda.mem_host_register_flags.DEVICEMAP) >> image_gpu = image_gpu_return.base.get_device_pointer() >> >> # kernel invocation using image_gpu, fails. >> ### end code >> >> Note that it is the kernel invocation that throws the exception---the >> aligned_empty & register_host_memory & get_device_pointer calls are >> fine. >> >> Anyone used this CUDA feature in PyCUDA and can shed some light on how >> to do it right? Thanks, >> Ahmed >> >> PS. Perhaps a note on why I'm trying to use device-mapped memory: any >> memory allocated this way will be written only once by the kernel, and >> it might not fit entirely in GPU memory. > > Not sure why you are observing a failure; the following gists run > without error on my system (CUDA 4.2.9, PyCUDA 2012.1, NVIDIA driver > 295.40, 64-bit Linux): > > https://gist.github.com/4036297 > https://gist.github.com/4036292 > > L.G.
Thanks Lev! These gists were really useful in understanding how to use these functions, and they work for me too. Nonetheless, I tried and succeeded in breaking the second one: see https://gist.github.com/4036693 First, I had to add "assert" in the calls to np.allclose to make sure I'd be informed if things weren't all close. Then I extended the kernel to work with multiple blocks, and finally I moved the unpinned test first. As I increased N from 20 to 22, both tests passed. But at N=23 (23 by 23 array), although the unpinned version works, the pinned assertion fails and PyCUDA complains that cleanup operations failed. I can't find any documented limit on the size of page-locked memory allocations, but it ought to be >3kb, right? Ubuntu 11.10, NVIDIA driver 304.51, CUDA 5, PyCUDA 2012.1, Tesla C2050. If you or any other kind soul is able to successfully run this gist, let me know! https://gist.github.com/4036693 Thanks again, Ahmed _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
