Hi,
I've been using pyCuda on a gtx 295 (not used for display, under linux),
and of course I need to use the two devices associated to that card in
parallel. With recent versions of pyCuda this works great using threading.
However I have run into several problems:
1) I am running many (> 10000) runs of quick kernels (each lasts up to a few
seconds max). The problem is that as context is thread-specific I needed to re-
create the context and re-compile the kernel many times, which takes a
significant amount of time when kernel execution is fast (<0.5s)
2) I had strange errors - several times after more than 32000 kernel
executions (and the same number of context creation + kernel compilation +
threads), the context creation would fail - without any clear reason.
I'm not sure yet what to do about (2) - it could come from the repeated
context creation and kernel compilations, memory transfers, or my strange
setup (using cuda 2.2 with the old cuda.so driver) - so let's forget about
that (I tried launching 40000 simple kernel executions with threads using
pycuda and it worked fine, so I don't think anything is wrong on pycuda's
side).
But in order to eliminate the multiple context creations and kernel
compilations I just tried another approach: I am now creating threads (one for
each of my devices), creating the context and compiling the kernel only once,
and then supplying data as many time as I need to the threads using Events.
I've attached the code with a dummy kernel as an example. It looks somewhat
kludgy, and it could certainly be written in a much better way using
multiprocessing (Process and Queue), but that way it works with python<2.6.
One thing I've not figured entirely correctly is automatically deleting the
threads when they are not needed any more.
I'll know tomorrow if it helps with problem (2) - if my calculations are
still running, but this definitely increased a lot the speed for short kernels.
Any comment on the code ? I'll try to write this using multiprocessing.
Vincent
--
Vincent Favre-Nicolin
CEA/Grenoble Institut Nanosciences & Cryogénie -
http://inac.cea.fr/
Université Joseph Fourier - http://physique-eea.ujf-grenoble.fr/
tél: (+33) 4 38 78 95 40 fax: (+33) 4 38 78 51 38
# -*- coding: utf-8 -*-
import pycuda.driver as drv
import pycuda.compiler as compiler
import numpy
import time
import threading
drv.init()
assert drv.Device.count() >= 1
mod_test_str = """
__global__ void CUDA_test(const float *a, const float*b,float *result,const long nb)
{
// This is a dummy calculation, just for tests !
#define BLOCKSIZE 32
const unsigned long ix=threadIdx.x+blockDim.x*blockIdx.x;
const float ai=a[ix];
float s=0;
__shared__ float btmp[BLOCKSIZE];
for (unsigned long i=0;i<nb;i+=BLOCKSIZE)
{
btmp[threadIdx.x]=b[i+threadIdx.x];
__syncthreads();
for(unsigned int j=0;j<BLOCKSIZE;j++)
{
s+=ai*btmp[j];
}
}
result[ix]=s;
}
"""
class GPUThread(threading.Thread):
def __init__(self, devID):
threading.Thread.__init__(self)
assert drv.Device.count() >= devID+1
self.devID = devID
self.eventStart=threading.Event()
self.eventFinished=threading.Event()
self.join_flag=False
self.verbose=True
def run(self):
dev = drv.Device(self.devID)
ctx = dev.make_context()
mod_test = compiler.SourceModule(mod_test_str, options=["-use_fast_math"])
CUDA_test = mod_test.get_function("CUDA_test")
if self.verbose: print self.name," ...beginning"
while True:
self.eventStart.wait()
if self.join_flag: break
if self.verbose: print self.name," ...got a job !"
CUDA_test (drv.In(self.a),drv.In(self.b),drv.InOut(self.c),
numpy.int32(self.N),block=(32,1,1),grid=(int(self.N/32),1))
if self.verbose: print self.name," ...finished job !"
self.eventStart.clear()
self.eventFinished.set()
print self.name," ...exiting"
ctx.pop()
# Do this on one card only
gpu_name="GTX 295"
gpu_devices=[]
for i in xrange(drv.Device.count()):
if drv.Device(i).name().find(gpu_name)>=0:
gpu_devices.append(i)
nbthread=len(gpu_devices)
if nbthread==0:
raise Exception("No device found : is the name for your GPU device (%s) correct ?"%gpu_name)
threads=[]
time.sleep(0.1)
for i in xrange(nbthread):
threads.append(GPUThread(gpu_devices[i]))
threads[-1].start()
# Give some work
t0=time.time()
N=2**18
nbiter=20
a=numpy.random.uniform(-1,1,N).astype(numpy.float32)
b=numpy.random.uniform(-1,1,N).astype(numpy.float32)
c=a*0
for i in xrange(nbiter):
for j in xrange(nbthread):
print i,j
threads[j].N=N
threads[j].a=a
threads[j].b=b
threads[j].c=c
threads[j].eventFinished.clear()
threads[j].eventStart.set()
for i in xrange(nbthread):
threads[j].eventFinished.wait()
for j in xrange(nbthread):
print "Finishing thread #",j
threads[j].join_flag=True
threads[j].eventStart.set()
threads[j].join()
dt=time.time()-t0
print "dt=%6.3f, GFlops=%f"%(dt,N*N*2.0/dt/1e9*nbiter*nbthread)
_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net