Also, try "-use_gpu_aware_mpi 0" to see if there is a difference.
--Junchao Zhang On Thu, Feb 10, 2022 at 1:40 PM Junchao Zhang <junchao.zh...@gmail.com> wrote: > Did it fail without GPU at 64 MPI ranks? > > --Junchao Zhang > > > On Thu, Feb 10, 2022 at 1:22 PM Sajid Ali Syed <sas...@fnal.gov> wrote: > >> Hi PETSc-developers, >> >> I’m seeing the following crash that occurs during the setup phase of the >> preconditioner when using multiple GPUs. The relevant error trace is shown >> below: >> >> (GTL DEBUG: 26) cuIpcOpenMemHandle: resource already mapped, >> CUDA_ERROR_ALREADY_MAPPED, line no 272 >> [24]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> [24]PETSC ERROR: General MPI error >> [24]PETSC ERROR: MPI error 1 Invalid buffer pointer >> [24]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. >> [24]PETSC ERROR: Petsc Development GIT revision: >> f351d5494b5462f62c419e00645ac2e477b88cae GIT Date: 2022-02-08 15:08:19 +0000 >> ... >> [24]PETSC ERROR: #1 PetscSFLinkWaitRequests_MPI() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfmpi.c:54 >> [24]PETSC ERROR: #2 PetscSFLinkFinishCommunication() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/../src/vec/is/sf/impls/basic/sfpack.h:274 >> [24]PETSC ERROR: #3 PetscSFBcastEnd_Basic() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/impls/basic/sfbasic.c:218 >> [24]PETSC ERROR: #4 PetscSFBcastEnd() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/sf.c:1499 >> [24]PETSC ERROR: #5 VecScatterEnd_Internal() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:87 >> [24]PETSC ERROR: #6 VecScatterEnd() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/vec/is/sf/interface/vscat.c:1366 >> [24]PETSC ERROR: #7 MatMult_MPIAIJCUSPARSE() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:302 >> [24]PETSC ERROR: #8 MatMult() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/mat/interface/matrix.c:2438 >> [24]PETSC ERROR: #9 PCApplyBAorAB() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:730 >> [24]PETSC ERROR: #10 KSP_PCApplyBAorAB() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/include/petsc/private/kspimpl.h:421 >> [24]PETSC ERROR: #11 KSPGMRESCycle() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:162 >> [24]PETSC ERROR: #12 KSPSolve_GMRES() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/impls/gmres/gmres.c:247 >> [24]PETSC ERROR: #13 KSPSolve_Private() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:925 >> [24]PETSC ERROR: #14 KSPSolve() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:1103 >> [24]PETSC ERROR: #15 PCGAMGOptProlongator_AGG() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/agg.c:1127 >> [24]PETSC ERROR: #16 PCSetUp_GAMG() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/impls/gamg/gamg.c:626 >> [24]PETSC ERROR: #17 PCSetUp() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/pc/interface/precon.c:1017 >> [24]PETSC ERROR: #18 KSPSetUp() at >> /tmp/sajid/spack-stage/spack-stage-petsc-main-mnj56kbexro3fipf6kheyttljzwss7fo/spack-src/src/ksp/ksp/interface/itfunc.c:417 >> [24]PETSC ERROR: #19 main() at poisson3d.c:69 >> [24]PETSC ERROR: PETSc Option Table entries: >> [24]PETSC ERROR: -dm_mat_type aijcusparse >> [24]PETSC ERROR: -dm_vec_type cuda >> [24]PETSC ERROR: -ksp_monitor >> [24]PETSC ERROR: -ksp_norm_type unpreconditioned >> [24]PETSC ERROR: -ksp_type cg >> [24]PETSC ERROR: -ksp_view >> [24]PETSC ERROR: -log_view >> [24]PETSC ERROR: -mg_levels_esteig_ksp_type cg >> [24]PETSC ERROR: -mg_levels_ksp_type chebyshev >> [24]PETSC ERROR: -mg_levels_pc_type jacobi >> [24]PETSC ERROR: -pc_gamg_agg_nsmooths 1 >> [24]PETSC ERROR: -pc_gamg_square_graph 1 >> [24]PETSC ERROR: -pc_gamg_threshold 0.0 >> [24]PETSC ERROR: -pc_gamg_threshold_scale 0.0 >> [24]PETSC ERROR: -pc_gamg_type agg >> [24]PETSC ERROR: -pc_type gamg >> [24]PETSC ERROR: ----------------End of Error Message -------send entire >> error message to petsc-ma...@mcs.anl.gov---------- >> >> Attached with this email is the full error log and the submit script for >> a 8-node/64-GPU/64 MPI rank job. I’ll also note that the same program did >> not crash when using either 2 or 4 nodes (with 8 & 16 GPUs/MPI ranks >> respectively) and attach those logs as well if that helps. Could someone >> let me know what this error means and what can be done to prevent it? >> >> Thank You, >> Sajid Ali (he/him) | Research Associate >> >> Scientific Computing Division >> >> Fermi National Accelerator Laboratory >> >> s-sajid-ali.github.io >> >> >