Hi Junchao,

Thank you for your suggestion, you're right that binding MPI ranks to GPUs 
seems to be the issue.
I looked at the TACC documentation, and I'm not sure they provide this utility.
I'm trying to set the CUDA_VISIBLE_DEVICES environment variable according to 
the MPI rank.

This works sometimes now! The environment variables are set properly, but it 
still fails with the same error half the time.
How do I know that hypre is binding MPI ranks to GPUs properly?  The error 
originates from a call to hypre.

I also tried to set the environment variable (using mpi4py) before importing 
PETSc, but this doesn't seem to work either.

Here is the preamble I added to the top of the script. I'm running on a single 
node with 3 GPUs.
``
import numpy,petsc4py,sys,os,time
from time import time
petsc4py.init(sys.argv)
from petsc4py import PETSc

comm  = PETSc.COMM_WORLD

os.environ['CUDA_VISIBLE_DEVICES'] = "%d" % comm.Get_rank()
PETSc.Sys.syncPrint("\t Processor %d of %d gets GPU %d"%\
        (comm.Get_rank(),comm.Get_size(),comm.Get_rank()),comm=comm,flush=True)
comm.Barrier()

### Petsc Matrix initialization here

### I confirm that the matrix is partitioned into indices as I expect
PETSc.Sys.syncPrint("\t Processor %d with GPU %s gets indices %d:%d"\
        
%(comm.Get_rank(),os.environ['CUDA_VISIBLE_DEVICES'],rstart,rend),flush=True,comm=comm)
``

When the script fails, I get the following stack trace.
``
TACC:  Starting up job 1491828
TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
TACC:  Starting parallel tasks...
       Processor 0 of 3 gets GPU 0
       Processor 1 of 3 gets GPU 1
       Processor 2 of 3 gets GPU 2
       Processor 0 with GPU 0 gets indices 0:166667
       Processor 1 with GPU 1 gets indices 166667:333334
       Processor 2 with GPU 2 gets indices 333334:500000
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at 
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at 
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at 
/work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``

________________________________
From: Junchao Zhang <junchao.zh...@gmail.com>
Sent: Wednesday, January 31, 2024 5:36 PM
To: Yesypenko, Anna <a...@oden.utexas.edu>
Cc: petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov>
Subject: Re: [petsc-users] errors with hypre with MPI and multiple GPUs on a 
node

Hi Anna,
  Since you said "The code works with pc-type hypre on a single GPU.", I was 
wondering if this is a CUDA devices to MPI ranks binding problem.
  You can search TACC documentation to find how its job scheduler binds GPUs to 
MPI ranks (usually via manipulating the CUDA_VISIBLE_DEVICES environment 
variable)

  Please follow up if you could not solve it.

  Thanks.
--Junchao Zhang


On Wed, Jan 31, 2024 at 4:07 PM Yesypenko, Anna 
<a...@oden.utexas.edu<mailto:a...@oden.utexas.edu>> wrote:
Dear Petsc devs,

I'm encountering an error running hypre on a single node with multiple GPUs.
The issue is in the setup phase. I'm trying to troubleshoot, but don't know 
where to start.
Are the system routines PetScCUDAInitialize and PetScCUDAInitializeCheck 
available in python?
How do I verify that GPUs are assigned properly to each MPI process? In this 
case, I have 3 tasks and 3 GPUs.

The code works with pc-type hypre on a single GPU.
Any suggestions are appreciated!

Below is the error trace:
``
TACC:  Starting up job 1490124
TACC:  Setting up parallel environment for MVAPICH2+mpispawn.
TACC:  Starting parallel tasks...
[0]PETSC ERROR: 
------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably 
memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
https://petsc.org/release/faq/
[0]PETSC ERROR: or try https://docs.nvidia.com/cuda/cuda-memcheck/index.html on 
NVIDIA CUDA systems to find memory corruption errors
[0]PETSC ERROR: ---------------------  Stack Frames 
------------------------------------
[0]PETSC ERROR: The line numbers in the error traceback are not always exact.
[0]PETSC ERROR: #1 hypre_ParCSRMatrixMigrate()
[0]PETSC ERROR: #2 MatBindToCPU_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1394
[0]PETSC ERROR: #3 MatAssemblyEnd_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:1471
[0]PETSC ERROR: #4 MatAssemblyEnd() at 
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:5773
[0]PETSC ERROR: #5 MatConvert_AIJ_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/mat/impls/hypre/mhypre.c:660
[0]PETSC ERROR: #6 MatConvert() at 
/work/06368/annayesy/ls6/petsc/src/mat/interface/matrix.c:4421
[0]PETSC ERROR: #7 PCSetUp_HYPRE() at 
/work/06368/annayesy/ls6/petsc/src/ksp/pc/impls/hypre/hypre.c:245
[0]PETSC ERROR: #8 PCSetUp() at 
/work/06368/annayesy/ls6/petsc/src/ksp/pc/interface/precon.c:1080
[0]PETSC ERROR: #9 KSPSetUp() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:415
[0]PETSC ERROR: #10 KSPSolve_Private() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:833
[0]PETSC ERROR: #11 KSPSolve() at 
/work/06368/annayesy/ls6/petsc/src/ksp/ksp/interface/itfunc.c:1080
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0
``

Below is a minimum working example:
``
import numpy,petsc4py,sys,time
petsc4py.init(sys.argv)
from petsc4py import PETSc
from time import time

n     = int(5e5);
comm  = PETSc.COMM_WORLD

pA = PETSc.Mat(comm=comm)
pA.create(comm=comm)
pA.setSizes((n,n))
pA.setType(PETSc.Mat.Type.AIJ)
pA.setPreallocationNNZ(3)
rstart,rend=pA.getOwnershipRange()

print("\t Processor %d of %d gets indices 
%d:%d"%(comm.Get_rank(),comm.Get_size(),rstart,rend))
if (rstart == 0):
    pA.setValue(0,0,2); pA.setValue(0,1,-1)
if (rend == n):
    pA.setValue(n-1,n-2,-1); pA.setValue(n-1,n-1,2)

for index in range(rstart,rend):
    if (rstart > 0):
        pA.setValue(index,index-1,-1)
    pA.setValue(index,index,2)
    if (rend < n):
        pA.setValue(index,index+1,-1)

pA.assemble()
pA = pA.convert(mat_type='aijcusparse')

px,pb = pA.createVecs()
pb.set(1.0); px.set(1.0)

ksp = PETSc.KSP().create()
ksp.setOperators(pA)
ksp.setConvergenceHistory()
ksp.setType('cg')
ksp.getPC().setType('hypre')
ksp.setTolerances(rtol=1e-10)

ksp.solve(pb, px)                           # error is generated here
``

Best,
Anna

Reply via email to