Received from Baskaran Sankaran on Wed, Nov 11, 2015 at 01:50:09PM EST:
> Thanks Andreas for the hint. Actually, what I am trying is little bit
> complex than that. I have two python processes running on two GPUs. In a
> simpler setting, I have array x in gpu0's Python process to be transferred
> to gpu1's process and vice versa.
> 
> I solved it in this schema:
> * alloc-host-memory
> * memcpy from device to host (gpu0 to host; gpu1 to host)
> * send/receive objects in host memory to the Python process in the other
> gpu
> * memcpy from host to device within respective gpu
> 
> The solution and output from a sample run follow. Now, I wonder if it is
> possible to improve this further. One possibility is whether the device to
> host copy can be eliminated. Because, I need to transfer several theano
> tensors between multiple (up to 4) gpus and I need to do this quite
> frequently (say every nth mini batch) during training.
> 
> Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.

If you have access to a recent release of OpenMPI or MVAPICH2 built with CUDA
support, you may wish to try using mpi4py for transferring data between
GPUArrays in different processes; you can pass the MPI wrapper functions the
GPUArray pointers and let the underlying MPI implementation determine when to
take advantage of P2P.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/
http://neurokernel.github.io/


_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to