I looked at it before and checked again, and still see https://urldefense.us/v3/__https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html*inter-gpu-communication-with-cuda-aware-mpi__;Iw!!G_uCfscf7eWS!aoebw25LunX6PlUYyVIti_VNLojOyz8MwWS3U-t5NHMMb4GnFUFV-nGq1LbjnigF72oKfDjmLXk9vUwkcYZpFTq0PgLl$ > Using both MPI and NCCL to perform transfers between the same sets of CUDA devices concurrently is therefore not guaranteed to be safe.
I was scared by it. It means we have to replace all MPI device communications (what if they are from a third-party library?) with NCCL. --Junchao Zhang On Wed, Apr 17, 2024 at 8:27 AM Sreeram R Venkat <[email protected]> wrote: > Yes, I saw this paper > https://urldefense.us/v3/__https://www.sciencedirect.com/science/article/abs/pii/S016781912100079X__;!!G_uCfscf7eWS!aoebw25LunX6PlUYyVIti_VNLojOyz8MwWS3U-t5NHMMb4GnFUFV-nGq1LbjnigF72oKfDjmLXk9vUwkcYZpFWw9ViCb$ > > that mentioned it, and I heard in Barry's talk at SIAM PP this year about > the need for stream-aware MPI, so I was wondering if NCCL would be used in > PETSc to do GPU-GPU communication. > > On Wed, Apr 17, 2024, 7:58 AM Junchao Zhang <[email protected]> > wrote: > >> >> >> >> >> On Wed, Apr 17, 2024 at 7:51 AM Sreeram R Venkat <[email protected]> >> wrote: >> >>> Do you know if there are plans for NCCL support in PETSc? >>> >> What is your need? Do you mean using NCCL for the MPI communication? >> >> >>> >>> On Tue, Apr 16, 2024, 10:41 PM Junchao Zhang <[email protected]> >>> wrote: >>> >>>> Glad to hear you found a way. Did you use Frontera at TACC? If yes, >>>> I could have a try. >>>> >>>> --Junchao Zhang >>>> >>>> >>>> On Tue, Apr 16, 2024 at 8:35 PM Sreeram R Venkat <[email protected]> >>>> wrote: >>>> >>>>> I finally figured out a way to make it work. I had to build PETSc and >>>>> my application using the (non GPU-aware) Intel MPI. Then, before running, >>>>> I >>>>> switch to the MVAPICH2-GDR. I'm not sure why that works, but it's the only >>>>> way I've >>>>> ZjQcmQRYFpfptBannerStart >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> >>>>> ZjQcmQRYFpfptBannerEnd >>>>> I finally figured out a way to make it work. I had to build PETSc and >>>>> my application using the (non GPU-aware) Intel MPI. Then, before running, >>>>> I >>>>> switch to the MVAPICH2-GDR. >>>>> I'm not sure why that works, but it's the only way I've found to >>>>> compile and run successfully without throwing any errors about not having >>>>> a >>>>> GPU-aware MPI. >>>>> >>>>> >>>>> >>>>> On Fri, Dec 8, 2023 at 5:30 PM Mark Adams <[email protected]> wrote: >>>>> >>>>>> You may need to set some env variables. This can be system specific >>>>>> so you might want to look at docs or ask TACC how to run with GPU-aware >>>>>> MPI. >>>>>> >>>>>> Mark >>>>>> >>>>>> On Fri, Dec 8, 2023 at 5:17 PM Sreeram R Venkat <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Actually, when I compile my program with this build of PETSc and >>>>>>> run, I still get the error: >>>>>>> >>>>>>> PETSC ERROR: PETSc is configured with GPU support, but your MPI is >>>>>>> not GPU-aware. For better performance, please use a GPU-aware MPI. >>>>>>> >>>>>>> I have the mvapich2-gdr module loaded and MV2_USE_CUDA=1. >>>>>>> >>>>>>> Is there anything else I need to do? >>>>>>> >>>>>>> Thanks, >>>>>>> Sreeram >>>>>>> >>>>>>> On Fri, Dec 8, 2023 at 3:29 PM Sreeram R Venkat <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you, changing to CUDA 11.4 fixed the issue. The mvapich2-gdr >>>>>>>> module didn't require CUDA 11.4 as a dependency, so I was using 12.0 >>>>>>>> >>>>>>>> On Fri, Dec 8, 2023 at 1:15 PM Satish Balay <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Executing: mpicc -show >>>>>>>>> stdout: icc -I/opt/apps/cuda/11.4/include >>>>>>>>> -I/opt/apps/cuda/11.4/include -lcuda -L/opt/apps/cuda/11.4/lib64/stubs >>>>>>>>> -L/opt/apps/cuda/11.4/lib64 -lcudart -lrt >>>>>>>>> -Wl,-rpath,/opt/apps/cuda/11.4/lib64 -Wl,-rpath,XORIGIN/placeholder >>>>>>>>> -Wl,--build-id -L/opt/apps/cuda/11.4/lib64/ -lm >>>>>>>>> -I/opt/apps/intel19/mvapich2-gdr/2.3.7/include >>>>>>>>> -L/opt/apps/intel19/mvapich2-gdr/2.3.7/lib64 -Wl,-rpath >>>>>>>>> -Wl,/opt/apps/intel19/mvapich2-gdr/2.3.7/lib64 -Wl,--enable-new-dtags >>>>>>>>> -lmpi >>>>>>>>> >>>>>>>>> Checking for program /opt/apps/cuda/12.0/bin/nvcc...found >>>>>>>>> >>>>>>>>> Looks like you are trying to mix in 2 different cuda versions in >>>>>>>>> this build. >>>>>>>>> >>>>>>>>> Perhaps you need to use cuda-11.4 - with this install of mvapich.. >>>>>>>>> >>>>>>>>> Satish >>>>>>>>> >>>>>>>>> On Fri, 8 Dec 2023, Matthew Knepley wrote: >>>>>>>>> >>>>>>>>> > On Fri, Dec 8, 2023 at 1:54 PM Sreeram R Venkat < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >>>>>>>>> > > I am trying to build PETSc with CUDA using the CUDA-Aware >>>>>>>>> MVAPICH2-GDR. >>>>>>>>> > > >>>>>>>>> > > Here is my configure command: >>>>>>>>> > > >>>>>>>>> > > ./configure PETSC_ARCH=linux-c-debug-mvapich2-gdr >>>>>>>>> --download-hypre >>>>>>>>> > > --with-cuda=true --cuda-dir=$TACC_CUDA_DIR --with-hdf5=true >>>>>>>>> > > --with-hdf5-dir=$TACC_PHDF5_DIR --download-elemental >>>>>>>>> --download-metis >>>>>>>>> > > --download-parmetis --with-cc=mpicc --with-cxx=mpicxx >>>>>>>>> --with-fc=mpif90 >>>>>>>>> > > >>>>>>>>> > > which errors with: >>>>>>>>> > > >>>>>>>>> > > UNABLE to CONFIGURE with GIVEN OPTIONS (see >>>>>>>>> configure.log for >>>>>>>>> > > details): >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> --------------------------------------------------------------------------------------------- >>>>>>>>> > > CUDA compile failed with arch flags " -ccbin mpic++ >>>>>>>>> -std=c++14 >>>>>>>>> > > -Xcompiler -fPIC >>>>>>>>> > > -Xcompiler -fvisibility=hidden -g -lineinfo -gencode >>>>>>>>> > > arch=compute_80,code=sm_80" >>>>>>>>> > > generated from "--with-cuda-arch=80" >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > The same configure command works when I use the Intel MPI and >>>>>>>>> I can build >>>>>>>>> > > with CUDA. The full config.log file is attached. Please let me >>>>>>>>> know if you >>>>>>>>> > > need any other information. I appreciate your help with this. >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > The proximate error is >>>>>>>>> > >>>>>>>>> > Executing: nvcc -c -o >>>>>>>>> /tmp/petsc-kn3f29gl/config.packages.cuda/conftest.o >>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.setCompilers >>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.types >>>>>>>>> > -I/tmp/petsc-kn3f29gl/config.packages.cuda -ccbin mpic++ >>>>>>>>> -std=c++14 >>>>>>>>> > -Xcompiler -fPIC -Xcompiler -fvisibility=hidden -g -lineinfo >>>>>>>>> -gencode >>>>>>>>> > arch=compute_80,code=sm_80 >>>>>>>>> /tmp/petsc-kn3f29gl/config.packages.cuda/ >>>>>>>>> > conftest.cu >>>>>>>>> <https://urldefense.us/v3/__http://conftest.cu__;!!G_uCfscf7eWS!duKUz7pE9N0adJ-FOW7PLZ_1cSZvYlnqh7J0TIcZN0v8RLplcWxh1YE8Vis29K0cuw_zAvjdK-H9H2JYYuUUKRXxlA$> >>>>>>>>> > stdout: >>>>>>>>> > /opt/apps/cuda/11.4/include/crt/sm_80_rt.hpp(141): error: more >>>>>>>>> than one >>>>>>>>> > instance of overloaded function >>>>>>>>> "__nv_associate_access_property_impl" has >>>>>>>>> > "C" linkage >>>>>>>>> > 1 error detected in the compilation of >>>>>>>>> > "/tmp/petsc-kn3f29gl/config.packages.cuda/conftest.cu >>>>>>>>> <https://urldefense.us/v3/__http://conftest.cu__;!!G_uCfscf7eWS!duKUz7pE9N0adJ-FOW7PLZ_1cSZvYlnqh7J0TIcZN0v8RLplcWxh1YE8Vis29K0cuw_zAvjdK-H9H2JYYuUUKRXxlA$> >>>>>>>>> ". >>>>>>>>> > Possible ERROR while running compiler: exit code 1 >>>>>>>>> > stderr: >>>>>>>>> > /opt/apps/cuda/11.4/include/crt/sm_80_rt.hpp(141): error: more >>>>>>>>> than one >>>>>>>>> > instance of overloaded function >>>>>>>>> "__nv_associate_access_property_impl" has >>>>>>>>> > "C" linkage >>>>>>>>> > >>>>>>>>> > 1 error detected in the compilation of >>>>>>>>> > "/tmp/petsc-kn3f29gl/config.packages.cuda >>>>>>>>> > >>>>>>>>> > This looks like screwed up headers to me, but I will let someone >>>>>>>>> that >>>>>>>>> > understands CUDA compilation reply. >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > >>>>>>>>> > Matt >>>>>>>>> > >>>>>>>>> > Thanks, >>>>>>>>> > > Sreeram >>>>>>>>> > > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> >>>>>>>>
