Hi, Anna, Could you attach your petsc configure.log? --Junchao Zhang
On Thu, Feb 1, 2024 at 9:28 AM Yesypenko, Anna <a...@oden.utexas.edu> wrote: > Hi Junchao, > > Thank you for your suggestion, you're right that binding MPI ranks to GPUs > seems to be the issue. > I looked at the TACC documentation, and I'm not sure they provide this > utility. > I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according > to the MPI rank. > > This works sometimes now! The environment variables are set properly, but > it still fails with the same error half the time. > How do I know that hypre is binding MPI ranks to GPUs properly? The error > originates from a call to hypre. > > I also tried to set the environment variable (using mpi4py) before > importing PETSc, but this doesn't seem to work either. > > Here is the preamble I added to the top of the script. I'm running on a > single node with 3 GPUs. > `` > import numpy,petsc4py,sys,os,time > from time import time > petsc4py.init(sys.argv) > from petsc4py import PETSc > > comm = PETSc.COMM_WORLD > > os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank() > PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\ > > (comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True) > comm.Barrier() > > ### Petsc Matrix initialization here > > ### I confirm that the matrix is partitioned into indices as I expect > PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\ > > %(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm) > `` > > When the script fails, I get the following stack trace. > `` > TACC: Starting up job 1491828 > TACC: Setting up parallel environment for MVAPICH2+mpispawn. > TACC: Starting parallel tasks... > Processor 0 of 3 gets GPU 0 > Processor 1 of 3 gets GPU 1 > Processor 2 of 3 gets GPU 2 > Processor 0 with GPU 0 gets indices 0:166667 > Processor 1 with GPU 1 gets indices 166667:333334 > Processor 2 with GPU 2 gets indices 333334:500000 > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and > https://petsc.org/release/faq/ > [0]PETSC ERROR: or try > https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA > systems to find memory corruption errors > [0]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [0]PETSC ERROR: The line numbers in the error traceback are not always > exact. > [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate() > [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394 > [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471 > [0]PETSC ERROR: #4 MatAssemblyEnd() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773 > [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660 > [0]PETSC ERROR: #6 MatConvert() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421 > [0]PETSC ERROR: #7 PCSetUp_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245 > [0]PETSC ERROR: #8 PCSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080 > [0]PETSC ERROR: #9 KSPSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415 > [0]PETSC ERROR: #10 KSPSolve_Private() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833 > [0]PETSC ERROR: #11 KSPSolve() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080 > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 > `` > > ------------------------------ > *From:* Junchao Zhang <junchao.zh...@gmail.com> > *Sent:* Wednesday, January 31, 2024 5:36 PM > *To:* Yesypenko, Anna <a...@oden.utexas.edu> > *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] errors with hypre with MPI and multiple GPUs > on a node > > Hi Anna, > Since you said "The code works with pc-type hypre on a single GPU.", I > was wondering if this is a CUDA devices to MPI ranks binding problem. > You can search TACC documentation to find how its job scheduler binds > GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES > environment variable) > > Please follow up if you could not solve it. > > Thanks. > --Junchao Zhang > > > On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <a...@oden.utexas.edu> > wrote: > > Dear Petsc devs, > > I'm encountering an error running hypre on a single node with multiple > GPUs. > The issue is in the setup phase. I'm trying to troubleshoot, but don't > know where to start. > Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck > available in python? > How do I verify that GPUs are assigned properly to each MPI process? In > this case, I have 3 tasks and 3 GPUs. > > The code works with pc-type hypre on a single GPU. > Any suggestions are appreciated! > > Below is the error trace: > `` > TACC: Starting up job 1490124 > TACC: Setting up parallel environment for MVAPICH2+mpispawn. > TACC: Starting parallel tasks... > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and > https://petsc.org/release/faq/ > [0]PETSC ERROR: or try > https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA > systems to find memory corruption errors > [0]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [0]PETSC ERROR: The line numbers in the error traceback are not always > exact. > [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate() > [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394 > [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471 > [0]PETSC ERROR: #4 MatAssemblyEnd() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773 > [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660 > [0]PETSC ERROR: #6 MatConvert() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421 > [0]PETSC ERROR: #7 PCSetUp_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245 > [0]PETSC ERROR: #8 PCSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080 > [0]PETSC ERROR: #9 KSPSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415 > [0]PETSC ERROR: #10 KSPSolve_Private() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833 > [0]PETSC ERROR: #11 KSPSolve() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080 > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 > `` > > Below is a minimum working example: > `` > import numpy,petsc4py,sys,time > petsc4py.init(sys.argv) > from petsc4py import PETSc > from time import time > > n = int(5e5); > comm = PETSc.COMM_WORLD > > pA = PETSc.Mat(comm=comm) > pA.create(comm=comm) > pA.setSizes((n,n)) > pA.setType(PETSc.Mat.Type.AIJ) > pA.setPreallocationNNZ(3) > rstart,rend=pA.getOwnershipRange() > > print("\t Processor %d of %d gets indices > %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend)) > if (rstart == 0): > pA.setValue(0,0,2); pA.setValue(0,1,-1) > if (rend == n): > pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2) > > for index in range(rstart,rend): > if (rstart > 0): > pA.setValue(index,index-1,-1) > pA.setValue(index,index,2) > if (rend < n): > pA.setValue(index,index+1,-1) > > pA.assemble() > pA = pA.convert(mat_type='aijcusparse') > > px,pb = pA.createVecs() > pb.set(1.0); px.set(1.0) > > ksp = PETSc.KSP().create() > ksp.setOperators(pA) > ksp.setConvergenceHistory() > ksp.setType('cg') > ksp.getPC().setType('hypre') > ksp.setTolerances(rtol=1e-10) > > ksp.solve(pb, px) # error is generated here > `` > > Best, > Anna > >