Freddie:  

Thanks for the ideas about using MPI.  Luckily I am already comfortable with 
MPI and what you said makes a lot of good sense.  First and foremost, thanks 
for the great idea and I might well use that.  I did not think of that until 
you said it to me!

On the one hand, your comments made me realize that the basis of any workaround 
for the current PYCUDA API's inability to access multiple GPU devices within 
one process, is to be sure to use more than one python process in the 
application.  For some people it perhaps sounds kind of strange to say that one 
application can use multiple processes.  But that's actually a quite natural 
idea for those of us who are comfortable with parallel programming.  MPI is one 
good mechanism of many for multiple-process applications.  The pthreads and 
OpenMP and MPI are all quite good to know and to use in my opinion.   And, 
Python's built-in module 'multiprocessing' is another good mechanism for 
writing a multiple-process application.  I've used both to great results in the 
past. My experience is mainly in C language for MPI in my case but I also know 
about the existence of the python MPI wrapper modules.


I will also add that there are problems with the MPI approach however.  My 
program is sequential.  That's the important thing.  MPI is intended to be used 
for *parallel* programs.  While I am very happy to write parallel programs, I 
don't generally do that unless there's a good reason since the complexity of a 
parallel program, even a simple parallel program, generally increases by a lot. 
 My application is intended to be a PYCUDA learning tool to show people how to 
access multiple GPU using one python application, using the simplest means 
available to do so.  The current PYTHON application of mine can certainly 
remain -- in theory only -- in CUDA programming terms, a purely sequential host 
program, feeding chunks of work to the multiple GPU devices available on my 
host.  Now, of course, these GPU devices are indeed processing the chunks of 
work in parallel!!!!  But looking at the host code in python, the host code can 
continue getting the
 results and feeding new chunks of work to the many CUDA devices -- written in 
purely sequential host code in python language -- until all the application 
data has been completely processed.  The nice thing is that the host python 
code is still sequential code and therefore relatively easy for other people to 
understand without needing special programming skills related to parallel 
programming.  BUT, this is in theory only, not real world.


Now practically speaking, in the real world, you are absolutely still correct 
Freddie -- My plan for using purely sequential code breaks down for me today, 
because PYCUDA simply cannot do one python process talking to many GPU devices 
within the context of the same python process.


PYCUDA in its current incarnation, seems to absolutely be designed to require 
mapping 1-process to 1-device.  Therefore, I expect that my application at the 
end of the day, will be forced to spawn multiple processes, one python process 
per CUDA device.  In my case with many GPU devices, it means many python 
processes will need to run.  Alright, I can do it for sure  -- parallel 
programming is actually my thing -- but I don't necessarily like it except when 
it's a parallel program, but I do not like to use parallel programming 
platforms merely to work around a PyCUDA API limitation, and unfortunately 
that's the only reason for MPI in this case.  

Additionally, the particular machine architecture I happen to be using is a 
multi-core shared memory system.  What about MPI?  Message Passing Interface is 
intended for distributed memory applications. (Also BTW the application 
developer needs to learn how to program parallel code with message passing, 
which is very different than pthreads or OpenMP parallel shared memory 
programming).  YES I happily admit that MPI works also on shared memory pretty 
well, and that's good for MPI, but not the best here.  OpenMP and pthreads on 
the other hand are fully intended for shared memory high performance 
applications, such as my particular host.  And what about Python's 
Multiprocessing module?  Multiprocessing is a weird thing actually, as a 
distributed memory application design is required due to separate processes 
running, but all these processes can only run on a single shared memory 
machine, not distributed.  But OK I have done all these things before
 and it's OK to do it again.  These things are not ideal though.  Ideally we 
would prefer to match a shared memory parallel application with a shared memory 
machine architecture; or else match a distributed memory parallel application 
with a distributed memory machine architecture (or hybrid with hybrid hardware) 
(or GPU devices can be added to all of these things for a few more mathematical 
combinations of complicated application designs).  At the end of the day we 
make do with whatever we have.


MPI, or the multiprocessing module, will almost certainly be the tools I will 
choose to use to write my python application code to control and communicate 
between the processes.  My sadness comes from the increased complication these 
packages will add to my essentially sequential application, when I know it's 
not strictly needed except for PyCUDA's API weakness today.  The number of 
python programmers in the world who A) understand how to program parallel 
programs, logical-intersection with B) the number of people who know how to 
write CUDA programs, even using PyCUDA to help simplify CUDA device programming 
somewhat, is going to be a pretty small number of people!!!  You and me man!  
But who else can do both MPI and PyCUDA together?   Easier is always better, 
and that's what Python's good for supposedly.

I am confident I could also write a C host program that is pure sequential, and 
access all the GPU devices in one sequential host application. The equivalent 
PyCUDA application (in the case of the application I am talking about) needs to 
be more complex than an equivalent C program in significant ways.  Extra 
complexity is a bad thing for Python to be doing -- it's against the philosophy 
of Python.  Python applications are not supposed to be more complex than the 
equivalent C language applications.


Thanks for excellent thought-provoking comments!

 
Regards,


ga


________________________________
 From: Freddie Witherden <[email protected]>
To: Geoffrey Anderson <[email protected]> 
Cc: [email protected] 
Sent: Saturday, April 6, 2013 7:13 PM
Subject: Re: [PyCUDA] spread independent work across multiple GPU devices
 

On 05/04/13 21:18, Geoffrey Anderson wrote:
> Hello,
> 
> I have a question about multiple GPU devices.  I finished my first 
> original pycuda application today.  Pycuda is excellent for the 
> simplicity improvement of the programming as provided by the 
> ElementwiseKernel.  The ElementwiseKernel is much, much better than 
> fiddling with the memory hierarchies within the GPU device.
> Elementwise is excellent because I prefer to focus more of my
> development effort on my application's logic and its parallel
> decomposition of work and internal synchronization.
> 
> [SNIP]

The CUDA API is not particularly well suited to using multiple devices
concurrently.  It is doable (just about!) but is not pleasant.  Without
a doubt the best way to use multiple GPU devices is indirectly by
parallelising your application with MPI (for example, by using the
excellent mpi4py library).

Doing so will not only allow you to take advantage of multiple GPUs
inside of a single system but will also allow you to split your work
across multiple *physical* systems connected via Ethernet/IB/etc.  More
recent MPI implementations have near complete support for GPU Direct
allowing CUDA device pointers to be passed directly to MPI_Send/Recv
functions.  (And if GPU Direct is not applicable falling back to a CUDA
memcpy and regular MPI_Send/Recv.)

While it is indeed possible to archive a similar result using multiple
threads (with all of the caveats that entails) I would recommend against
any such approach.  Not only is it more limited than the MPI methodology
described above but it often results in inferior real-world performance.
(Welcome to the world of NUMA, where threaded applications come to die.)

Regards, Freddie.
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to