Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs

Mark Adams Tue, 19 Mar 2024 14:16:25 -0700

You want: -mat_type aijhipsparse

On Tue, Mar 19, 2024 at 5:06 PM Vanella, Marcos (Fed) <
[email protected]> wrote:


> Hi Mark, thanks. I'll try your suggestions. So, I would keep -mat_type
> mpiaijkokkos but -vec_type hip as runtime options?
> Thanks,
> Marcos
> ------------------------------
> *From:* Mark Adams <[email protected]>
> *Sent:* Tuesday, March 19, 2024 4:57 PM
> *To:* Vanella, Marcos (Fed) <[email protected]>
> *Cc:* PETSc users list <[email protected]>
> *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in
> AMD GPUs
>
> [keep on list]
>
> I have little experience with running hypre on GPUs but others might have
> more.
>
> 1M dogs/node is not a lot and NVIDIA has larger L1 cache and more mature
> compilers, etc. so it is not surprising that NVIDIA is faster.
> I suspect the gap would narrow with a larger problem.
>
> Also, why are you using Kokkos? It should not make a difference but you
> could check easily. Just use -vec_type hip with your current code.
>
> You could also test with GAMG, -pc_type gamg
>
> Mark
>
>
> On Tue, Mar 19, 2024 at 4:12 PM Vanella, Marcos (Fed) <
> [email protected]> wrote:
>
> Hi Mark, I run a canonical test we have to time our code. It is a propane
> fire on a burner within a box with around 1 million cells.
> I split the problem in 4 GPUS, single node, both in Polaris and Frontier.
> I compiled PETSc with gnu and HYPRE being downloaded and the following
> configure options:
>
>
>    - Polaris:
>    $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
>    FCOPTFLAGS="-O3" CUDAOPTFLAGS="-O3" --with-debugging=0
>    --download-suitesparse --download-hypre --with-cuda --with-cc=cc
>    --with-cxx=CC --with-fc=ftn --with-cudac=nvcc --with-cuda-arch=80
>    --download-cmake
>
>
>
>    - Frontier:
>    $./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
>    FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc
>    --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
>    --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a}
>    ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos
>    --download-kokkos-kernels --download-suitesparse --download-hypre
>    --download-cmake
>
>
> Our code was compiled also with gnu compilers and -O3 flag. I used latest
> (from this week) PETSc repo update. These are the timings for the test case:
>
>
>    - 8 meshes + 1Million cells case, 8 MPI processes, 4 GPUS, 2 MPI Procs
>    per GPU, 1 sec run time (~580 time steps, ~1160 Poisson solves):
>
>
> System              Poisson Solver          GPU Implementation
> Poisson Wall time (sec)         Total Wall time (sec)
> Polaris             CG + HYPRE PC           CUDA
> 80                              287
> Frontier            CG + HYPRE PC           Kokkos + HIP
> 158                             401
>
> It is interesting to see that the Poisson solves take twice the time in
> Frontier than in Polaris.
> Do you have experience on running HYPRE AMG on these machines? Is this
> difference between the CUDA implementation and Kokkos-kernels to be
> expected?
>
> I can run the case in both computers with the log flags you suggest. Might
> give more information on where the differences are.
>
> Thank you for your time,
> Marcos
>
>
> ------------------------------
> *From:* Mark Adams <[email protected]>
> *Sent:* Tuesday, March 5, 2024 2:41 PM
> *To:* Vanella, Marcos (Fed) <[email protected]>
> *Cc:* [email protected] <[email protected]>
> *Subject:* Re: [petsc-users] Running CG with HYPRE AMG preconditioner in
> AMD GPUs
>
> You can run with -log_view_gpu_time to get rid of the nans and get more
> data.
>
> You can run with -ksp_view to get more info on the solver and send that
> output.
>
> -options_left is also good to use so we can see what parameters you used.
>
> The last 100 in this row:
>
> KSPSolve            1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04
> 3.1e+04 12 100 100 100 49  12 100 100 100 98  2503    -nan      0 1.80e-05
>    0 0.00e+00  100
>
> tells us that all the flops were logged on GPUs.
>
> You do need at least 100K equations per GPU to see speedup, so don't worry
> about small problems.
>
> Mark
>
>
>
>
> On Tue, Mar 5, 2024 at 12:52 PM Vanella, Marcos (Fed) via petsc-users <
> [email protected]> wrote:
>
> Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos
> and hip options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3"
> FOPTFLAGS="-O3" FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
> Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos
> and hip options:
>
> ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3"
> FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
> --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a}
> ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos
> --download-kokkos-kernels --download-suitesparse --download-hypre
> --download-cmake
>
> and have started testing our code solving a Poisson linear system with CG
> + HYPRE preconditioner. Timings look rather high compared to compilations
> done on other machines that have NVIDIA cards. They are also not changing
> when using more than one GPU for the simple test I doing.
> Does anyone happen to know if HYPRE has an hip GPU implementation for
> Boomer AMG and is it compiled when configuring PETSc?
>
> Thanks!
>
> Marcos
>
>
> PS: This is what I see on the log file (-log_view) when running the case
> with 2 GPUs in the node:
>
>
> ------------------------------------------------------------------ PETSc
> Performance Summary:
> ------------------------------------------------------------------
>
> /ccs/home/vanellam/Firemodels_fork/fds/Build/mpich_gnu_frontier/fds_mpich_gnu_frontier
> on a arch-linux-frontier-opt-gcc named frontier04119 with 4 processors, by
> vanellam Tue Mar  5 12:42:29 2024
> Using Petsc Development GIT revision: v3.20.5-713-gabdf6bc0fcf  GIT Date:
> 2024-03-05 01:04:54 +0000
>
>                          Max       Max/Min     Avg       Total
> Time (sec):           8.368e+02     1.000   8.368e+02
> Objects:              0.000e+00     0.000   0.000e+00
> Flops:                2.546e+11     0.000   1.270e+11  5.079e+11
> Flops/sec:            3.043e+08     0.000   1.518e+08  6.070e+08
> MPI Msg Count:        1.950e+04     0.000   9.748e+03  3.899e+04
> MPI Msg Len (bytes):  1.560e+09     0.000   7.999e+04  3.119e+09
> MPI Reductions:       6.331e+04   2877.545
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N
> --> 2N flops
>                             and VecAXPY() for complex vectors of length N
> --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
> ---  -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count
> %Total     Avg         %Total    Count   %Total
>  0:      Main Stage: 8.3676e+02 100.0%  5.0792e+11 100.0%  3.899e+04
> 100.0%  7.999e+04      100.0%  3.164e+04  50.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this
> phase
>       %M - percent messages in this phase     %L - percent message lengths
> in this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)
>    CpuToGpu Count: total number of CPU to GPU copies per processor
>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)
>    GpuToCpu Count: total number of GPU to CPU copies per processor
>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)
>    GPU %F: percent flops on GPU in this event
>
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop
>        --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
> GpuToCpu - GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
> Count   Size  %F
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided       1201 0.0   nan nan 0.00e+00 0.0 2.0e+00 4.0e+00
> 6.0e+02  0  0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> BuildTwoSidedF      1200 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+02  0  0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> MatMult            19494 0.0   nan nan 1.35e+11 0.0 3.9e+04 8.0e+04
> 0.0e+00  7 53 100 100  0   7 53 100 100  0  -nan    -nan      0 1.80e-05
>  0 0.00e+00  100
> MatConvert             3 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.5e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> MatAssemblyBegin       2 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> MatAssemblyEnd         2 0.0   nan nan 0.00e+00 0.0 4.0e+00 2.0e+04
> 3.5e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> VecTDot            41382 0.0   nan nan 4.14e+10 0.0 0.0e+00 0.0e+00
> 2.1e+04  0 16  0  0 33   0 16  0  0 65  -nan    -nan      0 0.00e+00    0
> 0.00e+00  100
> VecNorm            20691 0.0   nan nan 2.07e+10 0.0 0.0e+00 0.0e+00
> 1.0e+04  0  8  0  0 16   0  8  0  0 33  -nan    -nan      0 0.00e+00    0
> 0.00e+00  100
> VecCopy             2394 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> VecSet             21888 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> VecAXPY            38988 0.0   nan nan 3.90e+10 0.0 0.0e+00 0.0e+00
> 0.0e+00  0 15  0  0  0   0 15  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  100
> VecAYPX            18297 0.0   nan nan 1.83e+10 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  7  0  0  0   0  7  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  100
> VecAssemblyBegin    1197 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+02  0  0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> VecAssemblyEnd      1197 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> VecScatterBegin    19494 0.0   nan nan 0.00e+00 0.0 3.9e+04 8.0e+04
> 0.0e+00  0  0 100 100  0   0  0 100 100  0  -nan    -nan      0 1.80e-05
>  0 0.00e+00  0
> VecScatterEnd      19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> SFSetGraph             1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> SFSetUp                1 0.0   nan nan 0.00e+00 0.0 4.0e+00 2.0e+04
> 5.0e-01  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> SFPack             19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 1.80e-05    0
> 0.00e+00  0
> SFUnpack           19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> KSPSetUp               1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> KSPSolve            1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04
> 3.1e+04 12 100 100 100 49  12 100 100 100 98  2503    -nan      0 1.80e-05
>    0 0.00e+00  100
> PCSetUp                1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.5e+00  0  0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
> PCApply            20691 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  5  0  0  0  0   5  0  0  0  0  -nan    -nan      0 0.00e+00    0
> 0.00e+00  0
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Object Type          Creations   Destructions. Reports information only
> for process 0.
>
> --- Event Stage 0: Main Stage
>
>               Matrix     7              3
>               Vector     7              1
>            Index Set     2              2
>    Star Forest Graph     1              0
>        Krylov Solver     1              0
>       Preconditioner     1              0
>
> ========================================================================================================================
> Average time to get PetscTime(): 3.01e-08
> Average time for MPI_Barrier(): 3.8054e-06
> Average time for zero size MPI_Send(): 7.101e-06
> #PETSc Option Table entries:
> -log_view # (source: command line)
> -mat_type mpiaijkokkos # (source: command line)
> -vec_type kokkos # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3
> FCOPTFLAGS=-O3 HIPOPTFLAGS=-O3 --with-debugging=0 --with-cc=cc
> --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc
> --LIBS="-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -lmpi
> -L/opt/cray/pe/mpich/8.1.23/gtl/lib -lmpi_gtl_hsa" --download-kokkos
> --download-kokkos-kernels --download-suitesparse --download-hypre
> --download-cmake
> -----------------------------------------
> Libraries compiled on 2024-03-05 17:04:36 on login08
> Machine characteristics:
> Linux-5.14.21-150400.24.46_12.0.83-cray_shasta_c-x86_64-with-glibc2.3.4
> Using PETSc directory: /autofs/nccs-svm1_home1/vanellam/Software/petsc
> Using PETSc arch: arch-linux-frontier-opt-gcc
> -----------------------------------------
>
> Using C compiler: cc  -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas
> -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector
> -fvisibility=hidden -O3
> Using Fortran compiler: ftn  -fPIC -Wall -ffree-line-length-none
> -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3
> -----------------------------------------
>
> Using include paths:
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/include
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include
> -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include/suitesparse
> -I/opt/rocm-5.4.0/include
> -----------------------------------------
>
> Using C linker: cc
> Using Fortran linker: ftn
> Using libraries:
> -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -lpetsc
> -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
> -Wl,-rpath,/opt/rocm-5.4.0/lib -L/opt/rocm-5.4.0/lib
> -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib
> -L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib
> -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/gtl/lib
> -L/opt/cray/pe/mpich/8.1.23/gtl/lib -Wl,-rpath,/opt/cray/pe/libsci/
> 22.12.1.1/GNU/9.1/x86_64/lib -L/opt/cray/pe/libsci/
> 22.12.1.1/GNU/9.1/x86_64/lib
> -Wl,-rpath,/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
> -L/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
> -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib
> -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.1.8/lib
> -L/opt/cray/pe/pmi/6.1.8/lib
> -Wl,-rpath,/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64
> -L/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64
> -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0
> -L/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0
> -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib64
> -L/opt/cray/pe/gcc/12.2.0/snos/lib64 -Wl,-rpath,/opt/rocm-5.4.0/llvm/lib
> -L/opt/rocm-5.4.0/llvm/lib -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib
> -L/opt/cray/pe/gcc/12.2.0/snos/lib -lHYPRE -lspqr -lumfpack -lklu -lcholmod
> -lamd -lkokkoskernels -lkokkoscontainers -lkokkoscore -lkokkossimd
> -lhipsparse -lhipblas -lhipsolver -lrocsparse -lrocsolver -lrocblas
> -lrocrand -lamdhip64 -lmpi -lmpi_gtl_hsa -ldarshan -lz -ldl -lxpmem
> -lgfortran -lm -lmpifort_gnu_91 -lmpi_gnu_91 -lsci_gnu_82_mpi -lsci_gnu_82
> -ldsmml -lpmi -lpmi2 -lgfortran -lquadmath -lpthread -lm -lgcc_s -lstdc++
> -lquadmath -lmpi -lmpi_gtl_hsa
> -----------------------------------------
>
>

Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs

Reply via email to