You want: -mat_type aijhipsparse On Tue, Mar 19, 2024 at 5:06 PM Vanella, Marcos (Fed) < [email protected]> wrote:
> Hi Mark, thanks. I'll try your suggestions. So, I would keep -mat_type > mpiaijkokkos but -vec_type hip as runtime options? > Thanks, > Marcos > ------------------------------ > *From:* Mark Adams <[email protected]> > *Sent:* Tuesday, March 19, 2024 4:57 PM > *To:* Vanella, Marcos (Fed) <[email protected]> > *Cc:* PETSc users list <[email protected]> > *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in > AMD GPUs > > [keep on list] > > I have little experience with running hypre on GPUs but others might have > more. > > 1M dogs/node is not a lot and NVIDIA has larger L1 cache and more mature > compilers, etc. so it is not surprising that NVIDIA is faster. > I suspect the gap would narrow with a larger problem. > > Also, why are you using Kokkos? It should not make a difference but you > could check easily. Just use -vec_type hip with your current code. > > You could also test with GAMG, -pc_type gamg > > Mark > > > On Tue, Mar 19, 2024 at 4:12 PM Vanella, Marcos (Fed) < > [email protected]> wrote: > > Hi Mark, I run a canonical test we have to time our code. It is a propane > fire on a burner within a box with around 1 million cells. > I split the problem in 4 GPUS, single node, both in Polaris and Frontier. > I compiled PETSc with gnu and HYPRE being downloaded and the following > configure options: > > > - Polaris: > $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" > FCOPTFLAGS="-O3" CUDAOPTFLAGS="-O3" --with-debugging=0 > --download-suitesparse --download-hypre --with-cuda --with-cc=cc > --with-cxx=CC --with-fc=ftn --with-cudac=nvcc --with-cuda-arch=80 > --download-cmake > > > > - Frontier: > $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" > FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc > --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc > --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} > ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos > --download-kokkos-kernels --download-suitesparse --download-hypre > --download-cmake > > > Our code was compiled also with gnu compilers and -O3 flag. I used latest > (from this week) PETSc repo update. These are the timings for the test case: > > > - 8 meshes + 1Million cells case, 8 MPI processes, 4 GPUS, 2 MPI Procs > per GPU, 1 sec run time (~580 time steps, ~1160 Poisson solves): > > > System Poisson Solver GPU Implementation > Poisson Wall time (sec) Total Wall time (sec) > Polaris CG + HYPRE PC CUDA > 80 287 > Frontier CG + HYPRE PC Kokkos + HIP > 158 401 > > It is interesting to see that the Poisson solves take twice the time in > Frontier than in Polaris. > Do you have experience on running HYPRE AMG on these machines? Is this > difference between the CUDA implementation and Kokkos-kernels to be > expected? > > I can run the case in both computers with the log flags you suggest. Might > give more information on where the differences are. > > Thank you for your time, > Marcos > > > ------------------------------ > *From:* Mark Adams <[email protected]> > *Sent:* Tuesday, March 5, 2024 2:41 PM > *To:* Vanella, Marcos (Fed) <[email protected]> > *Cc:* [email protected] <[email protected]> > *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in > AMD GPUs > > You can run with -log_view_gpu_time to get rid of the nans and get more > data. > > You can run with -ksp_view to get more info on the solver and send that > output. > > -options_left is also good to use so we can see what parameters you used. > > The last 100 in this row: > > KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 > 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05 > 0 0.00e+00 100 > > tells us that all the flops were logged on GPUs. > > You do need at least 100K equations per GPU to see speedup, so don't worry > about small problems. > > Mark > > > > > On Tue, Mar 5, 2024 at 12:52 PM Vanella, Marcos (Fed) via petsc-users < > [email protected]> wrote: > > Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos > and hip options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" > FOPTFLAGS="-O3" FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos > and hip options: > > ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" > FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc > --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc > --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} > ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos > --download-kokkos-kernels --download-suitesparse --download-hypre > --download-cmake > > and have started testing our code solving a Poisson linear system with CG > + HYPRE preconditioner. Timings look rather high compared to compilations > done on other machines that have NVIDIA cards. They are also not changing > when using more than one GPU for the simple test I doing. > Does anyone happen to know if HYPRE has an hip GPU implementation for > Boomer AMG and is it compiled when configuring PETSc? > > Thanks! > > Marcos > > > PS: This is what I see on the log file (-log_view) when running the case > with 2 GPUs in the node: > > > ------------------------------------------------------------------ PETSc > Performance Summary: > ------------------------------------------------------------------ > > /ccs/home/vanellam/Firemodels_fork/fds/Build/mpich_gnu_frontier/fds_mpich_gnu_frontier > on a arch-linux-frontier-opt-gcc named frontier04119 with 4 processors, by > vanellam Tue Mar 5 12:42:29 2024 > Using Petsc Development GIT revision: v3.20.5-713-gabdf6bc0fcf GIT Date: > 2024-03-05 01:04:54 +0000 > > Max Max/Min Avg Total > Time (sec): 8.368e+02 1.000 8.368e+02 > Objects: 0.000e+00 0.000 0.000e+00 > Flops: 2.546e+11 0.000 1.270e+11 5.079e+11 > Flops/sec: 3.043e+08 0.000 1.518e+08 6.070e+08 > MPI Msg Count: 1.950e+04 0.000 9.748e+03 3.899e+04 > MPI Msg Len (bytes): 1.560e+09 0.000 7.999e+04 3.119e+09 > MPI Reductions: 6.331e+04 2877.545 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total Count > %Total Avg %Total Count %Total > 0: Main Stage: 8.3676e+02 100.0% 5.0792e+11 100.0% 3.899e+04 > 100.0% 7.999e+04 100.0% 3.164e+04 50.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flop: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > AvgLen: average message length (bytes) > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flop in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over > all processors) > GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU > time over all processors) > CpuToGpu Count: total number of CPU to GPU copies per processor > CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per > processor) > GpuToCpu Count: total number of GPU to CPU copies per processor > GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per > processor) > GPU %F: percent flops on GPU in this event > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size > Count Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > --- Event Stage 0: Main Stage > > BuildTwoSided 1201 0.0 nan nan 0.00e+00 0.0 2.0e+00 4.0e+00 > 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > BuildTwoSidedF 1200 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > MatMult 19494 0.0 nan nan 1.35e+11 0.0 3.9e+04 8.0e+04 > 0.0e+00 7 53 100 100 0 7 53 100 100 0 -nan -nan 0 1.80e-05 > 0 0.00e+00 100 > MatConvert 3 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyBegin 2 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyEnd 2 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 > 3.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > VecTDot 41382 0.0 nan nan 4.14e+10 0.0 0.0e+00 0.0e+00 > 2.1e+04 0 16 0 0 33 0 16 0 0 65 -nan -nan 0 0.00e+00 0 > 0.00e+00 100 > VecNorm 20691 0.0 nan nan 2.07e+10 0.0 0.0e+00 0.0e+00 > 1.0e+04 0 8 0 0 16 0 8 0 0 33 -nan -nan 0 0.00e+00 0 > 0.00e+00 100 > VecCopy 2394 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > VecSet 21888 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > VecAXPY 38988 0.0 nan nan 3.90e+10 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 15 0 0 0 0 15 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 100 > VecAYPX 18297 0.0 nan nan 1.83e+10 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 7 0 0 0 0 7 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 100 > VecAssemblyBegin 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > VecAssemblyEnd 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > VecScatterBegin 19494 0.0 nan nan 0.00e+00 0.0 3.9e+04 8.0e+04 > 0.0e+00 0 0 100 100 0 0 0 100 100 0 -nan -nan 0 1.80e-05 > 0 0.00e+00 0 > VecScatterEnd 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > SFSetGraph 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > SFSetUp 1 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 > 5.0e-01 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > SFPack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 1.80e-05 0 > 0.00e+00 0 > SFUnpack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > KSPSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 > 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05 > 0 0.00e+00 100 > PCSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > PCApply 20691 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 5 0 0 0 0 5 0 0 0 0 -nan -nan 0 0.00e+00 0 > 0.00e+00 0 > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Object Type Creations Destructions. Reports information only > for process 0. > > --- Event Stage 0: Main Stage > > Matrix 7 3 > Vector 7 1 > Index Set 2 2 > Star Forest Graph 1 0 > Krylov Solver 1 0 > Preconditioner 1 0 > > ======================================================================================================================== > Average time to get PetscTime(): 3.01e-08 > Average time for MPI_Barrier(): 3.8054e-06 > Average time for zero size MPI_Send(): 7.101e-06 > #PETSc Option Table entries: > -log_view # (source: command line) > -mat_type mpiaijkokkos # (source: command line) > -vec_type kokkos # (source: command line) > #End of PETSc Option Table entries > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 sizeof(PetscInt) 4 > Configure options: COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 > FCOPTFLAGS=-O3 HIPOPTFLAGS=-O3 --with-debugging=0 --with-cc=cc > --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc > --LIBS="-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -lmpi > -L/opt/cray/pe/mpich/8.1.23/gtl/lib -lmpi_gtl_hsa" --download-kokkos > --download-kokkos-kernels --download-suitesparse --download-hypre > --download-cmake > ----------------------------------------- > Libraries compiled on 2024-03-05 17:04:36 on login08 > Machine characteristics: > Linux-5.14.21-150400.24.46_12.0.83-cray_shasta_c-x86_64-with-glibc2.3.4 > Using PETSc directory: /autofs/nccs-svm1_home1/vanellam/Software/petsc > Using PETSc arch: arch-linux-frontier-opt-gcc > ----------------------------------------- > > Using C compiler: cc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas > -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector > -fvisibility=hidden -O3 > Using Fortran compiler: ftn -fPIC -Wall -ffree-line-length-none > -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3 > ----------------------------------------- > > Using include paths: > -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/include > -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include > -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include/suitesparse > -I/opt/rocm-5.4.0/include > ----------------------------------------- > > Using C linker: cc > Using Fortran linker: ftn > Using libraries: > -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib > -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib > -lpetsc > -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib > -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib > -Wl,-rpath,/opt/rocm-5.4.0/lib -L/opt/rocm-5.4.0/lib > -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib > -L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib > -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/gtl/lib > -L/opt/cray/pe/mpich/8.1.23/gtl/lib -Wl,-rpath,/opt/cray/pe/libsci/ > 22.12.1.1/GNU/9.1/x86_64/lib -L/opt/cray/pe/libsci/ > 22.12.1.1/GNU/9.1/x86_64/lib > -Wl,-rpath,/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib > -L/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib > -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib > -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.1.8/lib > -L/opt/cray/pe/pmi/6.1.8/lib > -Wl,-rpath,/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 > -L/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 > -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 > -L/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 > -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib64 > -L/opt/cray/pe/gcc/12.2.0/snos/lib64 -Wl,-rpath,/opt/rocm-5.4.0/llvm/lib > -L/opt/rocm-5.4.0/llvm/lib -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib > -L/opt/cray/pe/gcc/12.2.0/snos/lib -lHYPRE -lspqr -lumfpack -lklu -lcholmod > -lamd -lkokkoskernels -lkokkoscontainers -lkokkoscore -lkokkossimd > -lhipsparse -lhipblas -lhipsolver -lrocsparse -lrocsolver -lrocblas > -lrocrand -lamdhip64 -lmpi -lmpi_gtl_hsa -ldarshan -lz -ldl -lxpmem > -lgfortran -lm -lmpifort_gnu_91 -lmpi_gnu_91 -lsci_gnu_82_mpi -lsci_gnu_82 > -ldsmml -lpmi -lpmi2 -lgfortran -lquadmath -lpthread -lm -lgcc_s -lstdc++ > -lquadmath -lmpi -lmpi_gtl_hsa > ----------------------------------------- > >
