Re: [OMPI devel] GPUDirect v1 issues

Sebastian Rinke Fri, 20 Jan 2012 12:20:45 -0500

With 

* MLNX OFED stack tailored for GPUDirect
* RHEL + kernel patch 
* MVAPICH2


it is possible to monitor GPUDirect v1 activities by means of observing changes 
to values in

* /sys/module/ib_core/parameters/gpu_direct_pages
* /sys/module/ib_core/parameters/gpu_direct_shares

By setting CUDA_NIC_INTEROP=1 there are no changes anymore.

Is there a different way now to monitor if GPUDirect actually works?

Sebastian.

On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote:

> It is documented in 
> http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf
> set CUDA_NIC_INTEROP=1
>  
>  
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Sebastian Rinke
> Sent: Wednesday, January 18, 2012 8:15 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> Setting the environment variable fixed the problem for Open MPI with CUDA 
> 4.0. Thanks!
>  
> However, I'm wondering why this is not documented in the NVIDIA GPUDirect 
> package.
>  
> Sebastian.
>  
> On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote:
> 
> 
> Yes, the step outlined in your second bullet is no longer necessary. 
>  
> Rolf
>  
>  
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
> Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 5:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> Thank you very much. I will try setting the environment variable and if 
> required also use the 4.1 RC2 version.
> 
> To clarify things a little bit for me, to set up my machine with GPUDirect v1 
> I did the following:
> 
> * Install RHEL 5.4
> * Use the kernel with GPUDirect support
> * Use the MLNX OFED stack with GPUDirect support
> * Install the CUDA developer driver
> 
> Does using CUDA  >= 4.0  make one of the above steps  redundant?
> 
> I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is  
> not needed any more?
> 
> Sebastian.
> 
> Rolf vandeVaart wrote:
> I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked 
> fine.  I do not have a machine right now where I can load CUDA 4.0 drivers.
> Any chance you can try CUDA 4.1 RC2?  There were some improvements in the 
> support (you do not need to set an environment variable for one)
>  http://developer.nvidia.com/cuda-toolkit-41
>  
> There is also a chance that setting the environment variable as outlined in 
> this link may help you.
> http://forums.nvidia.com/index.php?showtopic=200629
>  
> However, I cannot explain why MVAPICH would work and Open MPI would not.  
>  
> Rolf
>  
>   
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 12:08 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2.
>  
> Attached you find a little test case which is based on the GPUDirect v1 test
> case (mpi_pinned.c).
> In that program the sender splits a message into chunks and sends them
> separately to the receiver which posts the corresponding recvs. It is a kind 
> of
> pipelining.
>  
> In mpi_pinned.c:141 the offsets into the recv buffer are set.
> For the correct offsets, i.e. increasing them, it blocks with Open MPI.
>  
> Using line 142 instead (offset = 0) works.
>  
> The tarball attached contains a Makefile where you will have to adjust
>  
> * CUDA_INC_DIR
> * CUDA_LIB_DIR
>  
> Sebastian
>  
> On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote:
>  
>     
> Also, which version of MVAPICH2 did you use?
>  
> I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2)
> vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting.
>  
> Ken
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>       
> mpi.org]
>     
> On Behalf Of Rolf vandeVaart
> Sent: Tuesday, January 17, 2012 7:54 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] GPUDirect v1 issues
>  
> I am not aware of any issues.  Can you send me a test program and I
> can try it out?
> Which version of CUDA are you using?
>  
> Rolf
>  
>       
> -----Original Message-----
> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>         
> mpi.org]
>     
> On Behalf Of Sebastian Rinke
> Sent: Tuesday, January 17, 2012 8:50 AM
> To: Open MPI Developers
> Subject: [OMPI devel] GPUDirect v1 issues
>  
> Dear all,
>  
> I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking
> MPI_SEND/RECV to block forever.
>  
> For two subsequent MPI_RECV, it hangs if the recv buffer pointer of
> the second recv points to somewhere, i.e. not at the beginning, in
> the recv buffer (previously allocated with cudaMallocHost()).
>  
> I tried the same with MVAPICH2 and did not see the problem.
>  
> Does anybody know about issues with GPUDirect v1 using Open MPI?
>  
> Thanks for your help,
> Sebastian
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>         
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>   
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] GPUDirect v1 issues

Reply via email to