[PyCuda] threading with permanent context

Vincent Favre-Nicolin Tue, 12 May 2009 12:52:56 -0700

        Hi,

   I've been using pyCuda on a gtx 295 (not used for display, under linux), 
and of course I need to use the two devices associated to that card in 
parallel. With recent versions of pyCuda this works great using threading. 
However I have run into several problems:


1) I am running many (> 10000) runs of quick kernels (each lasts up to a few 
seconds max). The problem is that as context is thread-specific I needed to re-
create the context and re-compile the kernel many times, which takes a 
significant amount of time when kernel execution is fast (<0.5s)

2) I had strange errors - several times after more than 32000 kernel 
executions (and the same number of context creation + kernel compilation + 
threads), the context creation would fail - without any clear reason.

  I'm not sure yet what to do about (2) - it could come from the repeated 
context creation and kernel compilations, memory transfers, or my strange 
setup (using cuda 2.2 with the old cuda.so driver) - so let's forget about 
that (I tried launching 40000 simple kernel executions with threads using 
pycuda and it worked fine, so I don't think anything is wrong on pycuda's 
side).

  But in order to eliminate the multiple context creations and kernel 
compilations I just tried another approach: I am now creating threads (one for 
each of my devices), creating the context and compiling the kernel only once, 
and then supplying data as many time as I need to the threads using Events.

  I've attached the code with a dummy kernel as an example. It looks somewhat 
kludgy, and it could certainly be written in a much better way using 
multiprocessing (Process and Queue), but that way it works with python<2.6. 
One thing I've not figured entirely correctly is automatically deleting the 
threads when they are not needed any more.

  I'll know tomorrow if it helps with problem (2) - if my calculations are 
still running, but this definitely increased a lot the speed for short kernels.

  Any comment on the code ? I'll try to write this using multiprocessing.

        Vincent
-- 
Vincent Favre-Nicolin                   

CEA/Grenoble       Institut Nanosciences & Cryogénie    -   
http://inac.cea.fr/

Université Joseph Fourier  -  http://physique-eea.ujf-grenoble.fr/

tél: (+33) 4 38 78 95 40           fax: (+33) 4 38 78 51 38

# -*- coding: utf-8 -*-
import pycuda.driver as drv
import pycuda.compiler as compiler
import numpy
import time
import threading

drv.init()
assert drv.Device.count() >= 1

mod_test_str = """
__global__ void CUDA_test(const float *a, const float*b,float *result,const long nb)
{
    // This is a dummy calculation, just for tests !
    #define BLOCKSIZE 32
    const unsigned long ix=threadIdx.x+blockDim.x*blockIdx.x;
    const float ai=a[ix];
    float s=0;
    __shared__ float btmp[BLOCKSIZE];
    for (unsigned long i=0;i<nb;i+=BLOCKSIZE)
    {
        btmp[threadIdx.x]=b[i+threadIdx.x];
        __syncthreads();
        for(unsigned int j=0;j<BLOCKSIZE;j++)
        {
            s+=ai*btmp[j];
        }
    }
    result[ix]=s;
}
"""

class GPUThread(threading.Thread):
    def __init__(self, devID):
        threading.Thread.__init__(self)
        assert drv.Device.count() >= devID+1
        self.devID = devID
        self.eventStart=threading.Event()
        self.eventFinished=threading.Event()
        self.join_flag=False
        self.verbose=True
    def run(self):
        dev = drv.Device(self.devID)
        ctx = dev.make_context()
        mod_test = compiler.SourceModule(mod_test_str, options=["-use_fast_math"])
        CUDA_test = mod_test.get_function("CUDA_test")
        
        if self.verbose: print self.name," ...beginning"
        while True:
            self.eventStart.wait()
            if self.join_flag: break
            if self.verbose: print self.name," ...got a job !"
            CUDA_test (drv.In(self.a),drv.In(self.b),drv.InOut(self.c),
                          numpy.int32(self.N),block=(32,1,1),grid=(int(self.N/32),1))
            if self.verbose: print self.name," ...finished job !"
            self.eventStart.clear()
            self.eventFinished.set()
        print self.name," ...exiting"
        ctx.pop()
     
# Do this on one card only
gpu_name="GTX 295"
gpu_devices=[]
for i in xrange(drv.Device.count()):
    if drv.Device(i).name().find(gpu_name)>=0:
        gpu_devices.append(i)

nbthread=len(gpu_devices)

if nbthread==0:
    raise Exception("No device found : is the name for your GPU device (%s) correct ?"%gpu_name)

threads=[]

time.sleep(0.1)
for i in xrange(nbthread):
    threads.append(GPUThread(gpu_devices[i]))
    threads[-1].start()

# Give some work 
t0=time.time()
N=2**18
nbiter=20
a=numpy.random.uniform(-1,1,N).astype(numpy.float32)
b=numpy.random.uniform(-1,1,N).astype(numpy.float32)
c=a*0
for i in xrange(nbiter):
    for j in xrange(nbthread):
        print i,j
        threads[j].N=N
        threads[j].a=a
        threads[j].b=b
        threads[j].c=c
        threads[j].eventFinished.clear()
        threads[j].eventStart.set()
    for i in xrange(nbthread):
        threads[j].eventFinished.wait()

for j in xrange(nbthread):
    print "Finishing thread #",j
    threads[j].join_flag=True
    threads[j].eventStart.set()
    threads[j].join()

dt=time.time()-t0
print "dt=%6.3f, GFlops=%f"%(dt,N*N*2.0/dt/1e9*nbiter*nbthread)

_______________________________________________
PyCuda mailing list
[email protected]
http://tiker.net/mailman/listinfo/pycuda_tiker.net

[PyCuda] threading with permanent context

Reply via email to