Baskaran Sankaran <[email protected]> writes:
> Apologies for emailing you directly. I did subscribe to the PyCuda mailing
> list, but my request is not approved yet.

There is no approvals process. It's likely that the subscription request
went to your spam folder. I've CC'd the list. Maybe someone on there
knows.

> I have been using PyCuda in recent times to parallelize Theano across two
> GPUs and I should say that it has been really useful. For example, I was
> able to achieve 1.85x speedup of our Neural MT with Pycuda over the single
> GPU version.

I'm happy to hear you're finding the software useful.

> I am now trying to see if I could parallelize it across more gpus. However,
> the gpus in this case are connected through socket-level links  and not
> through PCI-e switches. Here is the topology of a typical node in the gpu
> cluster.
>
> [xc181] ~ $  nvidia-smi topo -m
>       GPU0    GPU1    GPU2    GPU3    mlx4_0  CPU Affinity
> GPU0   X      PIX     SOC     SOC     PHB     0-7
> GPU1  PIX      X      SOC     SOC     PHB     0-7
> GPU2  SOC     SOC      X      PIX     SOC     8-15
> GPU3  SOC     SOC     PIX      X      SOC     8-15
> mlx4_0        PHB     PHB     SOC     SOC      X
>
> Legend:
>
>   X   = Self
>   SOC = Path traverses a socket-level link (e.g. QPI)
>   PHB = Path traverses a PCIe host bridge
>   PXB = Path traverses multiple PCIe internal switches
>   PIX = Path traverses a PCIe internal switch
>
> So, I wonder whether the PyCuda peer-to-peer copy (memcpy_peer) will work
> for these socket-level links. I am unable to test this in the cluster here,
> because the GPUDirect is enabled only between pairs of gpus (0-1 and 2-3).
> However, from the nvidia website, it seems the GPUDirect v3 supports RDMA
> that allows these kinds of transfers (across two nodes or across
> socket-linked nodes).
>
> https://developer.nvidia.com/gpudirect
> http://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/
>
> I must admit that I am not very familiar with the differences in the
> technologies and so my understanding could be incorrect.
> So, my question here is whether PyCuda memcpy_peer will support the RDMA
> style GPUDirect transfers? Any info will be greatly appreciated.

Sorry, I haven't used this technology myself, so I simply don't
know. What I can say is that if any amount of control over this is
available through the CUDA API, that same level of control should also
be achievable through PyCUDA.

Maybe someone on the list has an idea.

Hope that helps at least a bit,
Andreas

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to