Hi Junchao, Thank you for your suggestion, you're right that binding MPI ranks to GPUs seems to be the issue. I looked at the TACC documentation, and I'm not sure they provide this utility. I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according to the MPI rank.
This works sometimes now! The environment variables are set properly, but it still fails with the same error half the time. How do I know that hypre is binding MPI ranks to GPUs properly? The error originates from a call to hypre. I also tried to set the environment variable (using mpi4py) before importing PETSc, but this doesn't seem to work either. Here is the preamble I added to the top of the script. I'm running on a single node with 3 GPUs. `` import numpy,petsc4py,sys,os,time from time import time petsc4py.init(sys.argv) from petsc4py import PETSc comm = PETSc.COMM_WORLD os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank() PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\ (comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True) comm.Barrier() ### Petsc Matrix initialization here ### I confirm that the matrix is partitioned into indices as I expect PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\ %(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm) `` When the script fails, I get the following stack trace. `` TACC: Starting up job 1491828 TACC: Setting up parallel environment for MVAPICH2+mpispawn. TACC: Starting parallel tasks... Processor 0 of 3 gets GPU 0 Processor 1 of 3 gets GPU 1 Processor 2 of 3 gets GPU 2 Processor 0 with GPU 0 gets indices 0:166667 Processor 1 with GPU 1 gets indices 166667:333334 Processor 2 with GPU 2 gets indices 333334:500000 [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/ [0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [0]PETSC ERROR: The line numbers in the error traceback are not always exact. [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate() [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394 [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471 [0]PETSC ERROR: #4 MatAssemblyEnd() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773 [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660 [0]PETSC ERROR: #6 MatConvert() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421 [0]PETSC ERROR: #7 PCSetUp_HYPRE() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245 [0]PETSC ERROR: #8 PCSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080 [0]PETSC ERROR: #9 KSPSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415 [0]PETSC ERROR: #10 KSPSolve_Private() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833 [0]PETSC ERROR: #11 KSPSolve() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080 application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 `` ________________________________ From: Junchao Zhang <junchao.zh...@gmail.com> Sent: Wednesday, January 31, 2024 5:36 PM To: Yesypenko, Anna <a...@oden.utexas.edu> Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a node Hi Anna, Since you said "The code works with pc-type hypre on a single GPU.", I was wondering if this is a CUDA devices to MPI ranks binding problem. You can search TACC documentation to find how its job scheduler binds GPUs to MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES environment variable) Please follow up if you could not solve it. Thanks. --Junchao Zhang On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna <a...@oden.utexas.edu<mailto:a...@oden.utexas.edu>> wrote: Dear Petsc devs, I'm encountering an error running hypre on a single node with multiple GPUs. The issue is in the setup phase. I'm trying to troubleshoot, but don't know where to start. Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck available in python? How do I verify that GPUs are assigned properly to each MPI process? In this case, I have 3 tasks and 3 GPUs. The code works with pc-type hypre on a single GPU. Any suggestions are appreciated! Below is the error trace: `` TACC: Starting up job 1490124 TACC: Setting up parallel environment for MVAPICH2+mpispawn. TACC: Starting parallel tasks... [0]PETSC ERROR: ------------------------------------------------------------------------ [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/ [0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA systems to find memory corruption errors [0]PETSC ERROR: --------------------- Stack Frames ------------------------------------ [0]PETSC ERROR: The line numbers in the error traceback are not always exact. [0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate() [0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394 [0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471 [0]PETSC ERROR: #4 MatAssemblyEnd() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773 [0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at /work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660 [0]PETSC ERROR: #6 MatConvert() at /work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421 [0]PETSC ERROR: #7 PCSetUp_HYPRE() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245 [0]PETSC ERROR: #8 PCSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080 [0]PETSC ERROR: #9 KSPSetUp() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415 [0]PETSC ERROR: #10 KSPSolve_Private() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833 [0]PETSC ERROR: #11 KSPSolve() at /work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080 application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0 `` Below is a minimum working example: `` import numpy,petsc4py,sys,time petsc4py.init(sys.argv) from petsc4py import PETSc from time import time n = int(5e5); comm = PETSc.COMM_WORLD pA = PETSc.Mat(comm=comm) pA.create(comm=comm) pA.setSizes((n,n)) pA.setType(PETSc.Mat.Type.AIJ) pA.setPreallocationNNZ(3) rstart,rend=pA.getOwnershipRange() print("\t Processor %d of %d gets indices %d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend)) if (rstart == 0): pA.setValue(0,0,2); pA.setValue(0,1,-1) if (rend == n): pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2) for index in range(rstart,rend): if (rstart > 0): pA.setValue(index,index-1,-1) pA.setValue(index,index,2) if (rend < n): pA.setValue(index,index+1,-1) pA.assemble() pA = pA.convert(mat_type='aijcusparse') px,pb = pA.createVecs() pb.set(1.0); px.set(1.0) ksp = PETSc.KSP().create() ksp.setOperators(pA) ksp.setConvergenceHistory() ksp.setType('cg') ksp.getPC().setType('hypre') ksp.setTolerances(rtol=1e-10) ksp.solve(pb, px) # error is generated here `` Best, Anna