Ok, thank you Ken and Rolf. I will have a look into the 4.1 version. @ Rolf: I actually meant MVAPICH2, since Open MPI requires to have CUDA_NIC_INTEROP=1 set. However, setting the environment variable does not show any changes in the files previously mentioned.
Nevertheless, you already answered my question. Thanks. Sebastian. On Jan 21, 2012, at 4:03 PM, Kenneth Lloyd wrote: > Sebastian, > > If possible, I strongly suggest you look into CUDA 4.1 r2 and using Rolf > vandeVaart's MPI CUDA RDMA 3). Your life will be MUCH easier. > > Having used GPUDirect1 in the last half of 2010, I can say it is a pain for > the 9 - 14% gain in efficiency we saw. > > Ken > > On Fri, 2012-01-20 at 18:20 +0100, Sebastian Rinke wrote: >> >> With >> >> >> * MLNX OFED stack tailored for GPUDirect >> * RHEL + kernel patch >> * MVAPICH2 >> >> >> it is possible to monitor GPUDirect v1 activities by means of observing >> changes to values in >> >> >> * /sys/module/ib_core/parameters/gpu_direct_pages >> * /sys/module/ib_core/parameters/gpu_direct_shares >> >> >> By setting CUDA_NIC_INTEROP=1 there are no changes anymore. >> >> >> Is there a different way now to monitor if GPUDirect actually works? >> >> >> Sebastian. >> >> On Jan 18, 2012, at 5:06 PM, Kenneth Lloyd wrote: >> >>> It is documented in >>> http://developer.download.nvidia.com/compute/cuda/4_0/docs/GPUDirect_Technology_Overview.pdf >>> set CUDA_NIC_INTEROP=1 >>> >>> >>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On >>> Behalf Of Sebastian Rinke >>> Sent: Wednesday, January 18, 2012 8:15 AM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] GPUDirect v1 issues >>> >>> Setting the environment variable fixed the problem for Open MPI with CUDA >>> 4.0. Thanks! >>> >>> However, I'm wondering why this is not documented in the NVIDIA GPUDirect >>> package. >>> >>> Sebastian. >>> >>> On Jan 18, 2012, at 1:28 AM, Rolf vandeVaart wrote: >>> >>> >>> >>> Yes, the step outlined in your second bullet is no longer necessary. >>> >>> Rolf >>> >>> >>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On >>> Behalf Of Sebastian Rinke >>> Sent: Tuesday, January 17, 2012 5:22 PM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] GPUDirect v1 issues >>> >>> Thank you very much. I will try setting the environment variable and if >>> required also use the 4.1 RC2 version. >>> >>> To clarify things a little bit for me, to set up my machine with GPUDirect >>> v1 I did the following: >>> >>> * Install RHEL 5.4 >>> * Use the kernel with GPUDirect support >>> * Use the MLNX OFED stack with GPUDirect support >>> * Install the CUDA developer driver >>> >>> Does using CUDA >= 4.0 make one of the above steps redundant? >>> >>> I.e., RHEL or different kernel or MLNX OFED stack with GPUDirect support is >>> not needed any more? >>> >>> Sebastian. >>> >>> Rolf vandeVaart wrote: >>> I ran your test case against Open MPI 1.4.2 and CUDA 4.1 RC2 and it worked >>> fine. I do not have a machine right now where I can load CUDA 4.0 drivers. >>> Any chance you can try CUDA 4.1 RC2? There were some improvements in the >>> support (you do not need to set an environment variable for one) >>> http://developer.nvidia.com/cuda-toolkit-41 >>> >>> There is also a chance that setting the environment variable as outlined in >>> this link may help you. >>> http://forums.nvidia.com/index.php?showtopic=200629 >>> >>> However, I cannot explain why MVAPICH would work and Open MPI would not. >>> >>> Rolf >>> >>> >>> -----Original Message----- >>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] >>> On Behalf Of Sebastian Rinke >>> Sent: Tuesday, January 17, 2012 12:08 PM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] GPUDirect v1 issues >>> >>> I use CUDA 4.0 with MVAPICH2 1.5.1p1 and Open MPI 1.4.2. >>> >>> Attached you find a little test case which is based on the GPUDirect v1 test >>> case (mpi_pinned.c). >>> In that program the sender splits a message into chunks and sends them >>> separately to the receiver which posts the corresponding recvs. It is a >>> kind of >>> pipelining. >>> >>> In mpi_pinned.c:141 the offsets into the recv buffer are set. >>> For the correct offsets, i.e. increasing them, it blocks with Open MPI. >>> >>> Using line 142 instead (offset = 0) works. >>> >>> The tarball attached contains a Makefile where you will have to adjust >>> >>> * CUDA_INC_DIR >>> * CUDA_LIB_DIR >>> >>> Sebastian >>> >>> On Jan 17, 2012, at 4:16 PM, Kenneth A. Lloyd wrote: >>> >>> >>> Also, which version of MVAPICH2 did you use? >>> >>> I've been pouring over Rolf's OpenMPI CUDA RDMA 3 (using CUDA 4.1 r2) >>> vis MVAPICH-GPU on a small 3 node cluster. These are wickedly interesting. >>> >>> Ken >>> -----Original Message----- >>> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open- >>> >>> mpi.org] >>> >>> On Behalf Of Rolf vandeVaart >>> Sent: Tuesday, January 17, 2012 7:54 AM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] GPUDirect v1 issues >>> >>> I am not aware of any issues. Can you send me a test program and I >>> can try it out? >>> Which version of CUDA are you using? >>> >>> Rolf >>> >>> >>> -----Original Message----- >>> From: devel-boun...@open-mpi.org [mailto:devel-bounces@open- >>> >>> mpi.org] >>> >>> On Behalf Of Sebastian Rinke >>> Sent: Tuesday, January 17, 2012 8:50 AM >>> To: Open MPI Developers >>> Subject: [OMPI devel] GPUDirect v1 issues >>> >>> Dear all, >>> >>> I'm using GPUDirect v1 with Open MPI 1.4.3 and experience blocking >>> MPI_SEND/RECV to block forever. >>> >>> For two subsequent MPI_RECV, it hangs if the recv buffer pointer of >>> the second recv points to somewhere, i.e. not at the beginning, in >>> the recv buffer (previously allocated with cudaMallocHost()). >>> >>> I tried the same with MVAPICH2 and did not see the problem. >>> >>> Does anybody know about issues with GPUDirect v1 using Open MPI? >>> >>> Thanks for your help, >>> Sebastian >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ----------------------------------------------------------------------------------- >>> This email message is for the sole use of the intended recipient(s) and may >>> contain >>> confidential information. Any unauthorized review, use, disclosure or >>> distribution >>> is prohibited. If you are not the intended recipient, please contact the >>> sender by >>> reply email and destroy all copies of the original message. >>> ----------------------------------------------------------------------------------- >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ============== > Kenneth A. Lloyd, Jr. > CEO - Director of Systems Science > Watt Systems Technologies Inc. > Albuquerque, NM US > > This e-mail is covered by the Electronic Communications Privacy Act, 18 > U.S.C. 2510-2521 and is intended only for the addressee named above. It may > contain privileged or confidential information. If you are not the addressee > you must not copy, distribute, disclose or use any of the information in it. > If you have received it in error please delete it and immediately notify the > sender. > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel