I've had the same error as well when interoperating between PyCUDA and CUDA Runtime API based functions. My workaround is to avoid pycuda.autoinit and do this instead:
import pycuda.driver as cuda cuda.init() from pycuda.tools import make_default_context context = make_default_context() device = context.get_device() import atexit atexit.register(context.detach) The only difference here is that at exit, we're detaching from the cuda context instead of popping it as pycuda.autoinit does. That gets rid of the error, although it's probably not the "correct" solution to the problem. - bryan On Mon, Apr 12, 2010 at 12:35 AM, Paul Northug <pnort...@gmail.com> wrote: > (I am using cuda 3.0 on OS X 10.6.3, pycuda-0.94rc. The device is > GeForce 9600M GT.) > > I would like to use cuBLAS on gpuarray's in pycuda. At bottom is a > test matrix-matrix multiply program. In addition to not knowing what > I'm doing, I'm having the following problems: > > 1. The program runs correctly but terminates on exit with a: > > terminate called after throwing an instance of 'cuda::error' > what(): cuCtxPushCurrent failed: invalid value > > In the interoperability section 3.4 of the programming guide, context > stack manipulation is listed as not interperable. What does the error > mean and how can I avoid it? > > 2. When I do sgemm(a, b, c) where a and b are gpuarray's, I am getting > c = np.dot(b, a) instead of c = np.dot(a, b). Does gpuarray convert > row major format to something else (column?) in its internal > representation? Or am I calling sgemm incorrectly? > > 3. Now that it's possible to interoperate to some extent, are there > plans to add runtime features to pycuda? > > cuBLAS was about 5 times slower for small matrices (<100) and 4500 > times faster for larger matrices (>500) than numpy. Does that sound > about right? If so, that's impressive. What are comparable ratios for > newer cards and dgemm? > > Here is my code (one of my first). It depends on pystream.cublas, a > ctypes wrapper (http://code.google.com/p/pystream/): > > import numpy as np > from ctypes import * > from pystream import cublas > import pycuda.driver as cuda > import pycuda.autoinit > import pycuda.gpuarray as gpuarray > from time import time as now > > # dims > m, k, n = 10, 10, 10 > > # host > a = np.random.randn(m, k).astype(np.float32) > b = np.random.randn(k, n).astype(np.float32) > c = np.empty((m, n), dtype=np.float32) > > # device > a_g = gpuarray.to_gpu(a) > b_g = gpuarray.to_gpu(b) > c_g = gpuarray.empty(c.shape, dtype=np.float32) > > # cast to ctypes pointers to float > ap = cast(int(a_g.gpudata), POINTER(c_float)) > bp = cast(int(b_g.gpudata), POINTER(c_float)) > cp = cast(int(c_g.gpudata), POINTER(c_float)) > > # iterations for timing > t = 1000 > > # call cublas > cublas.cublasInit() > tic = now() > for i in range(t): > cublas.cublasSgemm('n', 'n', m, n, k, 1., > ap, k, > bp, n, 0., > cp, n) > toc = now() - tic > cublas.cublasShutdown() > c_g.get(c) > > print 'cublas' > print '%d iter, %g s/iter' % (t, toc / t) > print c > > # compare to numpy > cn = np.empty_like(c) > tic = now() > for i in range(t): > # cn = np.dot(a, b) # this doesn't work > cn = np.dot(b, a) > toc = now() - tic > > print 'numpy' > print '%d iter, %g s/iter' % (t, toc / t) > print cn > > print 'error ', ((c-cn)**2).sum() > > _______________________________________________ > PyCUDA mailing list > pyc...@host304.hostmonster.com > http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net > _______________________________________________ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net