On Dienstag 12 Mai 2009, Vincent Favre-Nicolin wrote: > Hi, > > I've been using pyCuda on a gtx 295 (not used for display, under linux), > and of course I need to use the two devices associated to that card in > parallel. With recent versions of pyCuda this works great using threading. > However I have run into several problems: > > 1) I am running many (> 10000) runs of quick kernels (each lasts up to a > few seconds max). The problem is that as context is thread-specific I > needed to re- create the context and re-compile the kernel many times, > which takes a significant amount of time when kernel execution is fast > (<0.5s) > > 2) I had strange errors - several times after more than 32000 kernel > executions (and the same number of context creation + kernel compilation + > threads), the context creation would fail - without any clear reason.
Check the temperature of your devices. I've seen weird behavior out of the cards things if they're insufficiently cooled. What error message do you see? > I'm not sure yet what to do about (2) - it could come from the repeated > context creation and kernel compilations, memory transfers, or my strange > setup (using cuda 2.2 with the old cuda.so driver) - so let's forget about > that (I tried launching 40000 simple kernel executions with threads using > pycuda and it worked fine, so I don't think anything is wrong on pycuda's > side). If you're only using one GPU, there's much less heat to dissipate. (Just a theory, though.) > But in order to eliminate the multiple context creations and kernel > compilations I just tried another approach: I am now creating threads (one > for each of my devices), creating the context and compiling the kernel only > once, and then supplying data as many time as I need to the threads using > Events. That seems like a wise thing to do. > I've attached the code with a dummy kernel as an example. It looks > somewhat kludgy, and it could certainly be written in a much better way > using multiprocessing (Process and Queue), but that way it works with > python<2.6. There's a backport of multiprocessing for 2.5 and below. In general, processes are much nicer than threads for PyCUDA. GC makes it a bit difficult to predict in which thread a destructor gets called, and CUDA gets upset if you try to clean up stuff from the wrong context. PyCUDA is aware of this issue and works around it. But if it gets a destructor called in the wrong thread/context, there's little it can do--in that case, it'll leak the object and spit a warning. If you keep all references to CUDA objects inside each context-associated thread and make sure you don't have reference cycles (or call gc.collect() if you do), you should be ok, though. > One thing I've not figured entirely correctly is automatically > deleting the threads when they are not needed any more. What's the problem here? > Any comment on the code ? I'll try to write this using multiprocessing. Looks good at first glance. Andreas
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ PyCuda mailing list [email protected] http://tiker.net/mailman/listinfo/pycuda_tiker.net
