Received from Baskaran Sankaran on Wed, Nov 11, 2015 at 01:50:09PM EST: > Thanks Andreas for the hint. Actually, what I am trying is little bit > complex than that. I have two python processes running on two GPUs. In a > simpler setting, I have array x in gpu0's Python process to be transferred > to gpu1's process and vice versa. > > I solved it in this schema: > * alloc-host-memory > * memcpy from device to host (gpu0 to host; gpu1 to host) > * send/receive objects in host memory to the Python process in the other > gpu > * memcpy from host to device within respective gpu > > The solution and output from a sample run follow. Now, I wonder if it is > possible to improve this further. One possibility is whether the device to > host copy can be eliminated. Because, I need to transfer several theano > tensors between multiple (up to 4) gpus and I need to do this quite > frequently (say every nth mini batch) during training. > > Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.
If you have access to a recent release of OpenMPI or MVAPICH2 built with CUDA support, you may wish to try using mpi4py for transferring data between GPUArrays in different processes; you can pass the MPI wrapper functions the GPUArray pointers and let the underlying MPI implementation determine when to take advantage of P2P. -- Lev Givon Bionet Group | Neurokernel Project http://lebedov.github.io/ http://neurokernel.github.io/ _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
