(I am using cuda 3.0 on OS X 10.6.3, pycuda-0.94rc. The device is
GeForce 9600M GT.)

I would like to use cuBLAS on gpuarray's in pycuda. At bottom is a
test matrix-matrix multiply program. In addition to not knowing what
I'm doing, I'm having the following problems:

1. The program runs correctly but terminates on exit with a:

terminate called after throwing an instance of 'cuda::error'
  what():  cuCtxPushCurrent failed: invalid value

In the interoperability section 3.4 of the programming guide, context
stack manipulation is listed as not interperable. What does the error
mean and how can I avoid it?

2. When I do sgemm(a, b, c) where a and b are gpuarray's, I am getting
c = np.dot(b, a) instead of c = np.dot(a, b). Does gpuarray convert
row major format to something else (column?) in its internal
representation? Or am I calling sgemm incorrectly?

3. Now that it's possible to interoperate to some extent, are there
plans to add runtime features to pycuda?

cuBLAS was about 5 times slower for small matrices (<100) and 4500
times faster for larger matrices (>500) than numpy. Does that sound
about right? If so, that's impressive. What are comparable ratios for
newer cards and dgemm?

Here is my code (one of my first). It depends on pystream.cublas, a
ctypes wrapper (http://code.google.com/p/pystream/):

import numpy as np
from ctypes import *
from pystream import cublas
import pycuda.driver as cuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
from time import time as now

# dims
m, k, n = 10, 10, 10

# host
a = np.random.randn(m, k).astype(np.float32)
b = np.random.randn(k, n).astype(np.float32)
c = np.empty((m, n), dtype=np.float32)

# device
a_g = gpuarray.to_gpu(a)
b_g = gpuarray.to_gpu(b)
c_g = gpuarray.empty(c.shape, dtype=np.float32)

# cast to ctypes pointers to float
ap = cast(int(a_g.gpudata), POINTER(c_float))
bp = cast(int(b_g.gpudata), POINTER(c_float))
cp = cast(int(c_g.gpudata), POINTER(c_float))

# iterations for timing
t = 1000

# call cublas
cublas.cublasInit()
tic = now()
for i in range(t):
    cublas.cublasSgemm('n', 'n', m, n, k, 1.,
                       ap, k,
                       bp, n, 0.,
                       cp, n)
toc = now() - tic
cublas.cublasShutdown()
c_g.get(c)

print 'cublas'
print '%d iter, %g s/iter' % (t, toc / t)
print c

# compare to numpy
cn = np.empty_like(c)
tic = now()
for i in range(t):
    # cn = np.dot(a, b)   # this doesn't work
    cn = np.dot(b, a)
toc = now() - tic

print 'numpy'
print '%d iter, %g s/iter' % (t, toc / t)
print cn

print 'error ', ((c-cn)**2).sum()

_______________________________________________
PyCUDA mailing list
pyc...@host304.hostmonster.com
http://host304.hostmonster.com/mailman/listinfo/pycuda_tiker.net

Reply via email to