Baskaran Sankaran <[email protected]> writes: > Apologies for emailing you directly. I did subscribe to the PyCuda mailing > list, but my request is not approved yet.
There is no approvals process. It's likely that the subscription request went to your spam folder. I've CC'd the list. Maybe someone on there knows. > I have been using PyCuda in recent times to parallelize Theano across two > GPUs and I should say that it has been really useful. For example, I was > able to achieve 1.85x speedup of our Neural MT with Pycuda over the single > GPU version. I'm happy to hear you're finding the software useful. > I am now trying to see if I could parallelize it across more gpus. However, > the gpus in this case are connected through socket-level links and not > through PCI-e switches. Here is the topology of a typical node in the gpu > cluster. > > [xc181] ~ $ nvidia-smi topo -m > GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity > GPU0 X PIX SOC SOC PHB 0-7 > GPU1 PIX X SOC SOC PHB 0-7 > GPU2 SOC SOC X PIX SOC 8-15 > GPU3 SOC SOC PIX X SOC 8-15 > mlx4_0 PHB PHB SOC SOC X > > Legend: > > X = Self > SOC = Path traverses a socket-level link (e.g. QPI) > PHB = Path traverses a PCIe host bridge > PXB = Path traverses multiple PCIe internal switches > PIX = Path traverses a PCIe internal switch > > So, I wonder whether the PyCuda peer-to-peer copy (memcpy_peer) will work > for these socket-level links. I am unable to test this in the cluster here, > because the GPUDirect is enabled only between pairs of gpus (0-1 and 2-3). > However, from the nvidia website, it seems the GPUDirect v3 supports RDMA > that allows these kinds of transfers (across two nodes or across > socket-linked nodes). > > https://developer.nvidia.com/gpudirect > http://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/ > > I must admit that I am not very familiar with the differences in the > technologies and so my understanding could be incorrect. > So, my question here is whether PyCuda memcpy_peer will support the RDMA > style GPUDirect transfers? Any info will be greatly appreciated. Sorry, I haven't used this technology myself, so I simply don't know. What I can say is that if any amount of control over this is available through the CUDA API, that same level of control should also be achievable through PyCUDA. Maybe someone on the list has an idea. Hope that helps at least a bit, Andreas _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
