Hi Anna, Since you said "The code works with pc-type hypre on a single GPU.", I was wondering if this is a CUDA devices to MPI ranks binding problem. You can search TACC documentation to find how its job scheduler binds GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES environment variable)
Please follow up if you could not solve it. Thanks. --Junchao Zhang On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <a...@oden.utexas.edu> wrote: > Dear Petsc devs, > > I'm encountering an error running hypre on a single node with multiple > GPUs. > The issue is in the setup phase. I'm trying to troubleshoot, but don't > know where to start. > Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck > available in python? > How do I verify that GPUs are assigned properly to each MPI process? In > this case, I have 3 tasks and 3 GPUs. > > The code works with pc-type hypre on a single GPU. > Any suggestions are appreciated! > > Below is the error trace: > `` > TACC: Starting up job 1490124 > TACC: Setting up parallel environment for MVAPICH2+mpispawn. > TACC: Starting parallel tasks... > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and > https://petsc.org/release/faq/ > [0]PETSC ERROR: or try > https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA > systems to find memory corruption errors > [0]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [0]PETSC ERROR: The line numbers in the error traceback are not always > exact. > [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate() > [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394 > [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471 > [0]PETSC ERROR: #4 MatAssemblyEnd() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773 > [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660 > [0]PETSC ERROR: #6 MatConvert() at > /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421 > [0]PETSC ERROR: #7 PCSetUp_HYPRE() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245 > [0]PETSC ERROR: #8 PCSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080 > [0]PETSC ERROR: #9 KSPSetUp() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415 > [0]PETSC ERROR: #10 KSPSolve_Private() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833 > [0]PETSC ERROR: #11 KSPSolve() at > /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080 > application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 > `` > > Below is a minimum working example: > `` > import numpy,petsc4py,sys,time > petsc4py.init(sys.argv) > from petsc4py import PETSc > from time import time > > n = int(5e5); > comm = PETSc.COMM_WORLD > > pA = PETSc.Mat(comm=comm) > pA.create(comm=comm) > pA.setSizes((n,n)) > pA.setType(PETSc.Mat.Type.AIJ) > pA.setPreallocationNNZ(3) > rstart,rend=pA.getOwnershipRange() > > print("\t Processor %d of %d gets indices > %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend)) > if (rstart == 0): > pA.setValue(0,0,2); pA.setValue(0,1,-1) > if (rend == n): > pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2) > > for index in range(rstart,rend): > if (rstart > 0): > pA.setValue(index,index-1,-1) > pA.setValue(index,index,2) > if (rend < n): > pA.setValue(index,index+1,-1) > > pA.assemble() > pA = pA.convert(mat_type='aijcusparse') > > px,pb = pA.createVecs() > pb.set(1.0); px.set(1.0) > > ksp = PETSc.KSP().create() > ksp.setOperators(pA) > ksp.setConvergenceHistory() > ksp.setType('cg') > ksp.getPC().setType('hypre') > ksp.setTolerances(rtol=1e-10) > > ksp.solve(pb, px) # error is generated here > `` > > Best, > Anna >