Re: [OMPI devel] GPUDirect v1 issues

Sebastian Rinke Mon, 23 Jan 2012 07:38:19 -0500

Ok, thank you Ken and Rolf. I will have a look into the 4.1 version.

@ Rolf:
I actually meant MVAPICH2, since Open MPI requires to have CUDA_NIC_INTEROP=1 
set.
However, setting the environment variable does not show any changes in the 
files previously mentioned.


Nevertheless, you already answered my question. Thanks.

Sebastian. 

On Jan 21, 2012, at 4:03 PM, Kenneth Lloyd wrote:

> Sebastian,
> 
> If possible, I strongly suggest you look into CUDA 4.1 r2 and using Rolf 
> vandeVaart's MPI CUDA RDMA 3).  Your life will be MUCH easier.
> 
> Having used GPUDirect1 in the last half of 2010, I can say it is a pain for 
> the 9 - 14% gain in efficiency we saw.
> 
> Ken
> 
> On Fri, 2012-01-20 at 18:20 +0100, Sebastian Rinke wrote:
>> 
>> With 
>> 
>> 
>> * MLNX OFED stack tailored for GPUDirect
>> * RHEL + kernel patch 
>> * MVAPICH2 
>> 
>> 
>> it is possible to monitor GPUDirect v1 activities by means of observing 
>> changes to values in
>> 
>> 
>> * /sys/module/ib_core/parameters/gpu_direct_pages
>> * /sys/module/ib_core/parameters/gpu_direct_shares
>> 
>> 
>> By setting CUDA_NIC_INTEROP=1 there are no changes anymore.
>> 
>> 
>> Is there a different way now to monitor if GPUDirect actually works?
>> 
>> 
>> Sebastian.
>> 
>> On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote:
>> 
>>> It is documented in 
>>> http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf
>>> set CUDA_NIC_INTEROP=1
>>>  
>>>  
>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
>>> Behalf Of Sebastian Rinke
>>> Sent: Wednesday, January 18, 2012 8:15 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] GPUDirect v1 issues
>>>  
>>> Setting the environment variable fixed the problem for Open MPI with CUDA 
>>> 4.0. Thanks!
>>>  
>>> However, I'm wondering why this is not documented in the NVIDIA GPUDirect 
>>> package.
>>>  
>>> Sebastian.
>>>  
>>> On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote:
>>> 
>>> 
>>> 
>>> Yes, the step outlined in your second bullet is no longer necessary. 
>>>  
>>> Rolf
>>>  
>>>  
>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
>>> Behalf Of Sebastian Rinke
>>> Sent: Tuesday, January 17, 2012 5:22 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] GPUDirect v1 issues
>>>  
>>> Thank you very much. I will try setting the environment variable and if 
>>> required also use the 4.1 RC2 version.
>>> 
>>> To clarify things a little bit for me, to set up my machine with GPUDirect 
>>> v1 I did the following:
>>> 
>>> * Install RHEL 5.4
>>> * Use the kernel with GPUDirect support
>>> * Use the MLNX OFED stack with GPUDirect support
>>> * Install the CUDA developer driver
>>> 
>>> Does using CUDA  >= 4.0  make one of the above steps  redundant?
>>> 
>>> I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is 
>>>  not needed any more?
>>> 
>>> Sebastian.
>>> 
>>> Rolf vandeVaart wrote:
>>> I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked 
>>> fine.  I do not have a machine right now where I can load CUDA 4.0 drivers.
>>> Any chance you can try CUDA 4.1 RC2?  There were some improvements in the 
>>> support (you do not need to set an environment variable for one)
>>> http://developer.nvidia.com/cuda-toolkit-41
>>>  
>>> There is also a chance that setting the environment variable as outlined in 
>>> this link may help you.
>>> http://forums.nvidia.com/index.php?showtopic=200629
>>>  
>>> However, I cannot explain why MVAPICH would work and Open MPI would not.  
>>>  
>>> Rolf
>>>  
>>>   
>>> -----Original Message-----
>>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>>> On Behalf Of Sebastian Rinke
>>> Sent: Tuesday, January 17, 2012 12:08 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] GPUDirect v1 issues
>>>  
>>> I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2.
>>>  
>>> Attached you find a little test case which is based on the GPUDirect v1 test
>>> case (mpi_pinned.c).
>>> In that program the sender splits a message into chunks and sends them
>>> separately to the receiver which posts the corresponding recvs. It is a 
>>> kind of
>>> pipelining.
>>>  
>>> In mpi_pinned.c:141 the offsets into the recv buffer are set.
>>> For the correct offsets, i.e. increasing them, it blocks with Open MPI.
>>>  
>>> Using line 142 instead (offset = 0) works.
>>>  
>>> The tarball attached contains a Makefile where you will have to adjust
>>>  
>>> * CUDA_INC_DIR
>>> * CUDA_LIB_DIR
>>>  
>>> Sebastian
>>>  
>>> On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote:
>>>  
>>>    
>>> Also, which version of MVAPICH2 did you use?
>>>  
>>> I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2)
>>> vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting.
>>>  
>>> Ken
>>> -----Original Message-----
>>> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>>>       
>>> mpi.org]
>>>     
>>> On Behalf Of Rolf vandeVaart
>>> Sent: Tuesday, January 17, 2012 7:54 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] GPUDirect v1 issues
>>>  
>>> I am not aware of any issues.  Can you send me a test program and I
>>> can try it out?
>>> Which version of CUDA are you using?
>>>  
>>> Rolf
>>>  
>>>       
>>> -----Original Message-----
>>> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
>>>         
>>> mpi.org]
>>>     
>>> On Behalf Of Sebastian Rinke
>>> Sent: Tuesday, January 17, 2012 8:50 AM
>>> To: Open MPI Developers
>>> Subject: [OMPI devel] GPUDirect v1 issues
>>>  
>>> Dear all,
>>>  
>>> I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking
>>> MPI_SEND/RECV to block forever.
>>>  
>>> For two subsequent MPI_RECV, it hangs if the recv buffer pointer of
>>> the second recv points to somewhere, i.e. not at the beginning, in
>>> the recv buffer (previously allocated with cudaMallocHost()).
>>>  
>>> I tried the same with MVAPICH2 and did not see the problem.
>>>  
>>> Does anybody know about issues with GPUDirect v1 using Open MPI?
>>>  
>>> Thanks for your help,
>>> Sebastian
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>         
>>> -----------------------------------------------------------------------------------
>>> This email message is for the sole use of the intended recipient(s) and may 
>>> contain
>>> confidential information.  Any unauthorized review, use, disclosure or 
>>> distribution
>>> is prohibited.  If you are not the intended recipient, please contact the 
>>> sender by
>>> reply email and destroy all copies of the original message.
>>> -----------------------------------------------------------------------------------
>>>  
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>   
>>>  
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>  
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ==============
> Kenneth A. Lloyd, Jr.
> CEO - Director of Systems Science
> Watt Systems Technologies Inc.
> Albuquerque, NM US
> 
> This e-mail is covered by the Electronic Communications Privacy Act, 18 
> U.S.C. 2510-2521 and is intended only for the addressee named above. It may 
> contain privileged or confidential information. If you are not the addressee 
> you must not copy, distribute, disclose or use any of the information in it. 
> If you have received it in error please delete it and immediately notify the 
> sender.
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] GPUDirect v1 issues

Reply via email to