Marcos, We do not have good petsc/gpu documentation, but see https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires: cuda" in petsc tests and you will find examples using GPU. For the Fortran compile errors, attach your configure.log and Satish (Cc'ed) or others should know how to fix them.
Thanks. --Junchao Zhang On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) < marcos.vane...@nist.gov> wrote: > Hi Junchao, thanks for the explanation. Is there some development > documentation on the GPU work? I'm interested learning about it. > I checked out the main branch and configured petsc. when compiling with > gcc/gfortran I come across this error: > > .... > CUDAC > arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o > CUDAC.dep > arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o > FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o > FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61: > > 37 | subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z) > | 1 > *Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit > interface* > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13: > > 38 | import tIS > | 1 > Error: IMPORT statement at (1) only permitted in an INTERFACE body > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80: > > 39 | PetscInt a ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80: > > 40 | PetscInt b ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80: > > 41 | PetscInt c ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80: > > 42 | PetscInt d ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80: > > 43 | PetscInt e ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80: > > 44 | PetscInt f ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80: > > 45 | PetscInt g ! PetscInt > | > 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30: > > 46 | IS h ! IS > | 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30: > > 47 | IS i ! IS > | 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43: > > 48 | PetscErrorCode z > | 1 > Error: Unexpected data declaration statement in INTERFACE block at (1) > > /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10: > > 49 | end subroutine PCASMCreateSubdomains2D > | 1 > Error: Expecting END INTERFACE statement at (1) > make[3]: *** [gmakefile:225: > arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > CC > arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o > CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o > CUDAC > arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o > CUDAC.dep > arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o > make[3]: Leaving directory '/home/mnv/Software/petsc' > make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs] > Error 2 > make[2]: Leaving directory '/home/mnv/Software/petsc' > **************************ERROR************************************* > Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log > Send it and arch-linux-c-opt/lib/petsc/conf/configure.log to > petsc-ma...@mcs.anl.gov > ******************************************************************** > make[1]: *** [makefile:45: all] Error 1 > make: *** [GNUmakefile:9: all] Error 2 > ------------------------------ > *From:* Junchao Zhang <junchao.zh...@gmail.com> > *Sent:* Friday, August 11, 2023 3:04 PM > *To:* Vanella, Marcos (Fed) <marcos.vane...@nist.gov> > *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi > processes and 1 GPU > > Hi, Macros, > I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack. > We recently refactored the COO code and got rid of that function. So could > you try petsc/main? > We map MPI processes to GPUs in a round-robin fashion. We query the > number of visible CUDA devices (g), and assign the device (rank%g) to the > MPI process (rank). In that sense, the work distribution is totally > determined by your MPI work partition (i.e, yourself). > On clusters, this MPI process to GPU binding is usually done by the job > scheduler like slurm. You need to check your cluster's users' guide to see > how to bind MPI processes to GPUs. If the job scheduler has done that, the > number of visible CUDA devices to a process might just appear to be 1, > making petsc's own mapping void. > > Thanks. > --Junchao Zhang > > > On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) < > marcos.vane...@nist.gov> wrote: > > Hi Junchao, thank you for replying. I compiled petsc in debug mode and > this is what I get for the case: > > terminate called after throwing an instance of > 'thrust::system::system_error' > what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an > illegal memory access was encountered > > Program received signal SIGABRT: Process abort signal. > > Backtrace for this error: > #0 0x15264731ead0 in ??? > #1 0x15264731dc35 in ??? > #2 0x15264711551f in ??? > #3 0x152647169a7c in ??? > #4 0x152647115475 in ??? > #5 0x1526470fb7f2 in ??? > #6 0x152647678bbd in ??? > #7 0x15264768424b in ??? > #8 0x1526476842b6 in ??? > #9 0x152647684517 in ??? > #10 0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc > at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224 > #11 0x55bb46342ebb in > _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_EEEENS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_ > at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316 > #12 0x55bb46342ebb in > _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_EEEENS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_ > at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544 > #13 0x55bb46342ebb in > _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_ > at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669 > #14 0x55bb46317bc5 in > _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_EEEENS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_ > at /usr/local/cuda/include/thrust/detail/sort.inl:115 > #15 0x55bb46317bc5 in > _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_EEEENS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_ > at /usr/local/cuda/include/thrust/detail/sort.inl:305 > #16 0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic > at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/ > aijcusparse.cu:4452 > #17 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic > at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/ > mpiaijcusparse.cu:173 > #18 0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE > at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/ > mpiaijcusparse.cu:222 > #19 0x55bb468e01cf in MatSetPreallocationCOO > at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606 > #20 0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND > at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547 > #21 0x55bb469015e5 in MatProductSymbolic > at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803 > #22 0x55bb4694ade2 in MatPtAP > at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897 > #23 0x55bb4696d3ec in MatCoarsenApply_MISK_private > at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283 > #24 0x55bb4696eb67 in MatCoarsenApply_MISK > at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368 > #25 0x55bb4695bd91 in MatCoarsenApply > at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97 > #26 0x55bb478294d8 in PCGAMGCoarsen_AGG > at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524 > #27 0x55bb471d1cb4 in PCSetUp_GAMG > at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631 > #28 0x55bb464022cf in PCSetUp > at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994 > #29 0x55bb4718b8a7 in KSPSetUp > at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406 > #30 0x55bb4718f22e in KSPSolve_Private > at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824 > #31 0x55bb47192c0c in KSPSolve > at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070 > #32 0x55bb463efd35 in kspsolve_ > at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320 > #33 0x55bb45e94b32 in ??? > #34 0x55bb46048044 in ??? > #35 0x55bb46052ea1 in ??? > #36 0x55bb45ac5f8e in ??? > #37 0x1526470fcd8f in ??? > #38 0x1526470fce3f in ??? > #39 0x55bb45aef55d in ??? > #40 0xffffffffffffffff in ??? > -------------------------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 1771753 on node dgx02 exited > on signal 6 (Aborted). > -------------------------------------------------------------------------- > > BTW, I'm curious. If I set n MPI processes, each of them building a part > of the linear system, and g GPUs, how does PETSc distribute those n pieces > of system matrix and rhs in the g GPUs? Does it do some load balancing > algorithm? Where can I read about this? > Thank you and best Regards, I can also point you to my code repo in GitHub > if you want to take a closer look. > > Best Regards, > Marcos > > ------------------------------ > *From:* Junchao Zhang <junchao.zh...@gmail.com> > *Sent:* Friday, August 11, 2023 10:52 AM > *To:* Vanella, Marcos (Fed) <marcos.vane...@nist.gov> > *Cc:* petsc-users@mcs.anl.gov <petsc-users@mcs.anl.gov> > *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi > processes and 1 GPU > > Hi, Marcos, > Could you build petsc in debug mode and then copy and paste the whole > error stack message? > > Thanks > --Junchao Zhang > > > On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users < > petsc-users@mcs.anl.gov> wrote: > > Hi, I'm trying to run a parallel matrix vector build and linear solution > with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix > build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda > enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the > following error: > > terminate called after throwing an instance of > 'thrust::system::system_error' > *what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: > an illegal memory access was encountered* > > Program received signal SIGABRT: Process abort signal. > > Backtrace for this error: > terminate called after throwing an instance of > 'thrust::system::system_error' > what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an > illegal memory access was encountered > > Program received signal SIGABRT: Process abort signal. > > I'm new to submitting jobs in slurm that also use GPU resources, so I > might be doing something wrong in my submission script. This is it: > > #!/bin/bash > #SBATCH -J test > #SBATCH -e /home/Issues/PETSc/test.err > #SBATCH -o /home/Issues/PETSc/test.log > #SBATCH --partition=batch > #SBATCH --ntasks=2 > #SBATCH --nodes=1 > #SBATCH --cpus-per-task=1 > #SBATCH --ntasks-per-node=2 > #SBATCH --time=01:00:00 > #SBATCH --gres=gpu:1 > > export OMP_NUM_THREADS=1 > module load cuda/11.5 > module load openmpi/4.1.1 > > cd /home/Issues/PETSc > *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds > *-vec_type > mpicuda -mat_type mpiaijcusparse -pc_type gamg* > > If anyone has any suggestions on how o troubleshoot this please let me > know. > Thanks! > Marcos > > > >