(I am using cuda 3.0 on OS X 10.6.3, pycuda-0.94rc. The device is GeForce 9600M GT.)
I would like to use cuBLAS on gpuarray's in pycuda. At bottom is a test matrix-matrix multiply program. In addition to not knowing what I'm doing, I'm having the following problems: 1. The program runs correctly but terminates on exit with a: terminate called after throwing an instance of 'cuda::error' what(): cuCtxPushCurrent failed: invalid value In the interoperability section 3.4 of the programming guide, context stack manipulation is listed as not interperable. What does the error mean and how can I avoid it? 2. When I do sgemm(a, b, c) where a and b are gpuarray's, I am getting c = np.dot(b, a) instead of c = np.dot(a, b). Does gpuarray convert row major format to something else (column?) in its internal representation? Or am I calling sgemm incorrectly? 3. Now that it's possible to interoperate to some extent, are there plans to add runtime features to pycuda? cuBLAS was about 5 times slower for small matrices (<100) and 4500 times faster for larger matrices (>500) than numpy. Does that sound about right? If so, that's impressive. What are comparable ratios for newer cards and dgemm? Here is my code (one of my first). It depends on pystream.cublas, a ctypes wrapper (http://code.google.com/p/pystream/): import numpy as np from ctypes import * from pystream import cublas import pycuda.driver as cuda import pycuda.autoinit import pycuda.gpuarray as gpuarray from time import time as now # dims m, k, n = 10, 10, 10 # host a = np.random.randn(m, k).astype(np.float32) b = np.random.randn(k, n).astype(np.float32) c = np.empty((m, n), dtype=np.float32) # device a_g = gpuarray.to_gpu(a) b_g = gpuarray.to_gpu(b) c_g = gpuarray.empty(c.shape, dtype=np.float32) # cast to ctypes pointers to float ap = cast(int(a_g.gpudata), POINTER(c_float)) bp = cast(int(b_g.gpudata), POINTER(c_float)) cp = cast(int(c_g.gpudata), POINTER(c_float)) # iterations for timing t = 1000 # call cublas cublas.cublasInit() tic = now() for i in range(t): cublas.cublasSgemm('n', 'n', m, n, k, 1., ap, k, bp, n, 0., cp, n) toc = now() - tic cublas.cublasShutdown() c_g.get(c) print 'cublas' print '%d iter, %g s/iter' % (t, toc / t) print c # compare to numpy cn = np.empty_like(c) tic = now() for i in range(t): # cn = np.dot(a, b) # this doesn't work cn = np.dot(b, a) toc = now() - tic print 'numpy' print '%d iter, %g s/iter' % (t, toc / t) print cn print 'error ', ((c-cn)**2).sum() _______________________________________________ PyCUDA mailing list pyc...@host304.hostmonster.com http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net