[PyCUDA] spread independent work across multiple GPU devices

Geoffrey Anderson Fri, 05 Apr 2013 13:32:25 -0700

Hello,

I have a question about multiple GPU devices.  I finished my first original 
pycuda application today.  Pycuda is excellent for the simplicity improvement 
of the programming as provided by the ElementwiseKernel.  The ElementwiseKernel 
is much, much better than fiddling with the memory hierarchies within the GPU 
device.  Elementwise is excellent because I prefer to focus more of my 
development effort on my application's logic and its parallel decomposition of 
work and internal synchronization.


That said, how could my application use *both* of the GPU devices that I have 
installed on my workstation (or server)?   I am using a collection of GPU 
devices in the larger context of research in parallelizing application software 
for the CPU as well as for the GPU.  I already have developed an inventory of 
applications that are parallelizable on CPUs.  Therefore naturally it would 
make quite a lot of sense for me to partition the independent parts of the 
computations within my applications, and send one work partition to each of 
these GPU devices.  I envision that the results may be computed at the same 
time on my two CUDA devices, and then the results returned to the host 
application for final aggregation of the partial results after a parallel 
barrier, or after the completion of a sequence of blocking join calls one per 
CUDA device.

Here's the thing: The current API of the PyCUDA system seems to be designed or 
intended to use just one CUDA device at a time by any given application.   The 
current API looks at the environment variable CUDA_DEVICE or the special disk 
file, to decide which GPU device to the send the work to.  I am having a hard 
time to think of a way to use such an API to drive both GPU devices in a 
reliable or predictable manner.  I would love to see the existing work of 
others in this regard, if you would like to share it with us on the mailing 
list.

However, I could also imagine, in a near future, an improvement or an extension 
of the PyCUDA API where the application code might resemble something like this 
fictional sketch:

...
d0 = pycuda.autoinit.context.get_device(0)
d1 = pycuda.autoinit.context.get_device(1)
ga0 = gpuarray.to_gpu_async(d0, a[first[0]:last[0]])    # start copying data to 
d0
ga1 = gpuarray.to_gpu_async(d1, a[first[1]:last[1]])    # start copying some 
other data to d1
fn = ElementwiseKernel(...)    # compile the kernel into GPU code
ga0.waitcomplete()    # wait for the async array copy to d0 to complete
fn.start(d0, colsb, rowsb, ga0, gb, gc0)   # get the computation started on d0 
using the data copied to d0
ga1.waitcomplete()    # wait for the async array copy to d1 to complete
fn.start(d1, colsb, rowsb, ga1, gb, gc1)    # get the computation started on d1 
using the data copied to d1
fn.join(d0)  # wait for d0 to complete its computation
fn.join(d1)  # wait for d1 to complete its computation
CombineResults(gc0,gc1)
...

The pseudocode above resembles the pthreads API (you may have noticed).  The 
join() calls are blocking functions.  The start() calls are nonblocking.  It's 
not necessary to use this syntax, it's just convenient for this example.  There 
are of course other concurrent syntax models that could also work fine, like 
OpenMP/OpenACC style, or message passing (send/receive).

To anyone that may have ideas on this topic:  What is the possibility of 
getting PyCUDA to concurrently use multiple GPU devices concurrently from a 
Python app?  Is it currently possible to achieve similar results using the 
existing API, that is, have you *actually tried to do it already*?

Again, the PyCUDA is so nice to use.  I expect PyCUDA could draw a lot more 
people to adopt some GPU programming than the bone-stock Nvidia CUDA API (it 
did for me!!!)  Probably, many parallel programmers don't want to fiddle with 
hardware-vendor specific memory hierarchies, because concurrent applications 
programming can be quite difficult enough, without all that additional 
proprietary hardware code to write in addition.  That's true for me at least.  
Pthreads is well known to concurrent programmers so that's one way to do this.


Thank you for reading.

Geoffrey

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

[PyCUDA] spread independent work across multiple GPU devices

Reply via email to