Thanks Andreas, attached are a .cu file and a .py file that I've adapted from the linked Nvidia document to try and run with PyCUDA. I can compile and run the .cu file using the command included in its header, but when I try "python hello.py", I get this error:
$ python hello.py
Traceback (most recent call last):
File "hello.py", line 11, in <module>
mod = SourceModule(custr, options=['-rdc=true', '-lcudadevrt'])
File
"/usr/local/lib/python2.7/dist-packages/pycuda-2012.1-py2.7-linux-x86_64.egg/pycuda/compiler.py",
line 285, in __init__
self.module = module_from_buffer(cubin)
pycuda._driver.LogicError: cuModuleLoadDataEx failed: invalid image -
The patch for 4- vs 8-byte shmem magic is forthcoming. Thanks!
Ahmed
On Thu, Feb 14, 2013 at 12:36 AM, Andreas Kloeckner
<[email protected]> wrote:
> Hi Ahmed,
>
> Ahmed Fasih <[email protected]> writes:
>> Hi folks, I write in the hope that someone has gotten a K20 Kepler 3.5
>> compute capability device and has gotten it to do dynamic parallelism,
>> wherein a kernel can kick off grids on its own without returning to
>> the CPU. A "hello world" example is given at [1], page 23. With the
>> suggested nvcc command, I was able to compile and run the hello world
>> perfectly, but when I tried to load the kernels into PyCUDA, I get
>> LogicErrors, asserting
>>
>> "cuModuleLoadDataEx failed: invalid image -"
>>
>> This is before and after I take out the runtime API calls like
>> cudaDeviceSynchronize() in the kernel code, and persist after I pass
>> the "-rdc=true" and "-lcudadevrt" options into SourceModule. Could
>> PyCUDA be hating the runtime API calls in a kernel? If anybody's
>> gotten this to work, please share!
>>
>> On a related note, I've added support for the sm_35 (30?) feature of
>> choosing four versus eight byte banks for shared memory from PyCUDA, I
>> can send out a rough patch.
>>
>> [1]
>> http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
>>
>> For completeness, here's the original CUDA code from there:
>>
>> #include <stdio.h>
>> __global__ void childKernel()
>> {
>> printf("Hello ");
>> }
>> __global__ void parentKernel()
>> {
>> // launch child
>> childKernel<<<1,1>>>();
>> if (cudaSuccess != cudaGetLastError()) {
>> return;
>> }
>> // wait for child to complete
>> if (cudaSuccess != cudaDeviceSynchronize()) {
>> return;
>> }
>> printf("World!\n");
>> }
>> int main(int argc, char *argv[])
>> {
>> // launch parent
>> parentKernel<<<1,1>>>();
>> if (cudaSuccess != cudaGetLastError()) {
>> return 1;
>> }
>> // wait for parent to complete
>> if (cudaSuccess != cudaDeviceSynchronize()) {
>> return 2;
>> }
>> return 0;
>> }
>>
>>
>> And compiled via:
>> $ nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt
>>
>> (copied & pasted from NVIDIA doc [1], all rights reserved by NVIDIA etc.)
>
> Can you send a runnable example that exhibits that failure? (And yes,
> please send that patch!) I've got access to a K20, but that currently
> has the weird problem of working fine under CL and refusing to be
> detected under CUDA...?
>
> Andreas
--
Ahmed Fasih
[email protected]
[email protected]
614 547 3323 (Google Voice)
hello.py
Description: Binary data
hello_world.cu
Description: Binary data
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
