Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs

Vanella, Marcos (Fed) via petsc-users Tue, 19 Mar 2024 14:07:08 -0700

Hi Mark, thanks. I'll try your suggestions. So, I would keep -mat_type 
mpiaijkokkos but -vec_type hip as runtime options?
Thanks,
Marcos
________________________________
From: Mark Adams <[email protected]>
Sent: Tuesday, March 19, 2024 4:57 PM
To: Vanella, Marcos (Fed) <[email protected]>
Cc: PETSc users list <[email protected]>
Subject: Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs


[keep on list]

I have little experience with running hypre on GPUs but others might have more.

1M dogs/node is not a lot and NVIDIA has larger L1 cache and more mature 
compilers, etc. so it is not surprising that NVIDIA is faster.
I suspect the gap would narrow with a larger problem.

Also, why are you using Kokkos? It should not make a difference but you could 
check easily. Just use -vec_type hip with your current code.

You could also test with GAMG, -pc_type gamg

Mark


On Tue, Mar 19, 2024 at 4:12 PM Vanella, Marcos (Fed) 
<[email protected]<mailto:[email protected]>> wrote:
Hi Mark, I run a canonical test we have to time our code. It is a propane fire 
on a burner within a box with around 1 million cells.
I split the problem in 4 GPUS, single node, both in Polaris and Frontier. I 
compiled PETSc with gnu and HYPRE being downloaded and the following configure 
options:


  *
Polaris:
$./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" FCOPTFLAGS="-O3" 
CUDAOPTFLAGS="-O3" --with-debugging=0 --download-suitesparse --download-hypre 
--with-cuda --with-cc=cc --with-cxx=CC --with-fc=ftn --with-cudac=nvcc 
--with-cuda-arch=80 --download-cmake


  *
Frontier:
$./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" FCOPTFLAGS="-O3" 
HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc --with-cxx=CC --with-fc=ftn 
--with-hip --with-hipc=hipcc --LIBS="-L${MPICH_DIR}/lib -lmpi 
${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}" 
--download-kokkos --download-kokkos-kernels --download-suitesparse 
--download-hypre --download-cmake

Our code was compiled also with gnu compilers and -O3 flag. I used latest (from 
this week) PETSc repo update. These are the timings for the test case:


  *   8 meshes + 1Million cells case, 8 MPI processes, 4 GPUS, 2 MPI Procs per 
GPU, 1 sec run time (~580 time steps, ~1160 Poisson solves):

System              Poisson Solver          GPU Implementation          Poisson 
Wall time (sec)         Total Wall time (sec)
Polaris             CG + HYPRE PC           CUDA                        80      
                        287
Frontier            CG + HYPRE PC           Kokkos + HIP                158     
                        401

It is interesting to see that the Poisson solves take twice the time in 
Frontier than in Polaris.
Do you have experience on running HYPRE AMG on these machines? Is this 
difference between the CUDA implementation and Kokkos-kernels to be expected?

I can run the case in both computers with the log flags you suggest. Might give 
more information on where the differences are.

Thank you for your time,
Marcos


________________________________
From: Mark Adams <[email protected]<mailto:[email protected]>>
Sent: Tuesday, March 5, 2024 2:41 PM
To: Vanella, Marcos (Fed) 
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs

You can run with -log_view_gpu_time to get rid of the nans and get more data.

You can run with -ksp_view to get more info on the solver and send that output.

-options_left is also good to use so we can see what parameters you used.

The last 100 in this row:

KSPSolve            1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 
3.1e+04 12 100 100 100 49  12 100 100 100 98  2503    -nan      0 1.80e-05    0 
0.00e+00  100

tells us that all the flops were logged on GPUs.

You do need at least 100K equations per GPU to see speedup, so don't worry 
about small problems.

Mark




On Tue, Mar 5, 2024 at 12:52 PM Vanella, Marcos (Fed) via petsc-users 
<[email protected]<mailto:[email protected]>> wrote:
Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos and hip 
options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" 
FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.

ZjQcmQRYFpfptBannerEnd
Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos and hip 
options:

./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" FCOPTFLAGS="-O3" 
HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc --with-cxx=CC --with-fc=ftn 
--with-hip --with-hipc=hipcc --LIBS="-L${MPICH_DIR}/lib -lmpi 
${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}" 
--download-kokkos --download-kokkos-kernels --download-suitesparse 
--download-hypre --download-cmake

and have started testing our code solving a Poisson linear system with CG + 
HYPRE preconditioner. Timings look rather high compared to compilations done on 
other machines that have NVIDIA cards. They are also not changing when using 
more than one GPU for the simple test I doing.
Does anyone happen to know if HYPRE has an hip GPU implementation for Boomer 
AMG and is it compiled when configuring PETSc?

Thanks!

Marcos


PS: This is what I see on the log file (-log_view) when running the case with 2 
GPUs in the node:


------------------------------------------------------------------ PETSc 
Performance Summary: 
------------------------------------------------------------------

/ccs/home/vanellam/Firemodels_fork/fds/Build/mpich_gnu_frontier/fds_mpich_gnu_frontier
 on a arch-linux-frontier-opt-gcc named frontier04119 with 4 processors, by 
vanellam Tue Mar  5 12:42:29 2024
Using Petsc Development GIT revision: v3.20.5-713-gabdf6bc0fcf  GIT Date: 
2024-03-05 01:04:54 +0000

                         Max       Max/Min     Avg       Total
Time (sec):           8.368e+02     1.000   8.368e+02
Objects:              0.000e+00     0.000   0.000e+00
Flops:                2.546e+11     0.000   1.270e+11  5.079e+11
Flops/sec:            3.043e+08     0.000   1.518e+08  6.070e+08
MPI Msg Count:        1.950e+04     0.000   9.748e+03  3.899e+04
MPI Msg Len (bytes):  1.560e+09     0.000   7.999e+04  3.119e+09
MPI Reductions:       6.331e+04   2877.545

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N 
flops
                            and VecAXPY() for complex vectors of length N --> 
8N flops

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     
Avg         %Total    Count   %Total
 0:      Main Stage: 8.3676e+02 100.0%  5.0792e+11 100.0%  3.899e+04 100.0%  
7.999e+04      100.0%  3.164e+04  50.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all 
processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time 
over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided       1201 0.0   nan nan 0.00e+00 0.0 2.0e+00 4.0e+00 6.0e+02  0  
0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0 0.00e+00  0
BuildTwoSidedF      1200 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+02  0  
0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0 0.00e+00  0
MatMult            19494 0.0   nan nan 1.35e+11 0.0 3.9e+04 8.0e+04 0.0e+00  7 
53 100 100  0   7 53 100 100  0  -nan    -nan      0 1.80e-05    0 0.00e+00  100
MatConvert             3 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.5e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
MatAssemblyBegin       2 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
MatAssemblyEnd         2 0.0   nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 3.5e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
VecTDot            41382 0.0   nan nan 4.14e+10 0.0 0.0e+00 0.0e+00 2.1e+04  0 
16  0  0 33   0 16  0  0 65  -nan    -nan      0 0.00e+00    0 0.00e+00  100
VecNorm            20691 0.0   nan nan 2.07e+10 0.0 0.0e+00 0.0e+00 1.0e+04  0  
8  0  0 16   0  8  0  0 33  -nan    -nan      0 0.00e+00    0 0.00e+00  100
VecCopy             2394 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
VecSet             21888 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
VecAXPY            38988 0.0   nan nan 3.90e+10 0.0 0.0e+00 0.0e+00 0.0e+00  0 
15  0  0  0   0 15  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  100
VecAYPX            18297 0.0   nan nan 1.83e+10 0.0 0.0e+00 0.0e+00 0.0e+00  0  
7  0  0  0   0  7  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  100
VecAssemblyBegin    1197 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+02  0  
0  0  0  1   0  0  0  0  2  -nan    -nan      0 0.00e+00    0 0.00e+00  0
VecAssemblyEnd      1197 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
VecScatterBegin    19494 0.0   nan nan 0.00e+00 0.0 3.9e+04 8.0e+04 0.0e+00  0  
0 100 100  0   0  0 100 100  0  -nan    -nan      0 1.80e-05    0 0.00e+00  0
VecScatterEnd      19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
SFSetGraph             1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
SFSetUp                1 0.0   nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 5.0e-01  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
SFPack             19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 1.80e-05    0 0.00e+00  0
SFUnpack           19494 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
KSPSetUp               1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
KSPSolve            1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 
3.1e+04 12 100 100 100 49  12 100 100 100 98  2503    -nan      0 1.80e-05    0 
0.00e+00  100
PCSetUp                1 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.5e+00  0  
0  0  0  0   0  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
PCApply            20691 0.0   nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  5  
0  0  0  0   5  0  0  0  0  -nan    -nan      0 0.00e+00    0 0.00e+00  0
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Object Type          Creations   Destructions. Reports information only for 
process 0.

--- Event Stage 0: Main Stage

              Matrix     7              3
              Vector     7              1
           Index Set     2              2
   Star Forest Graph     1              0
       Krylov Solver     1              0
      Preconditioner     1              0
========================================================================================================================
Average time to get PetscTime(): 3.01e-08
Average time for MPI_Barrier(): 3.8054e-06
Average time for zero size MPI_Send(): 7.101e-06
#PETSc Option Table entries:
-log_view # (source: command line)
-mat_type mpiaijkokkos # (source: command line)
-vec_type kokkos # (source: command line)
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 FCOPTFLAGS=-O3 
HIPOPTFLAGS=-O3 --with-debugging=0 --with-cc=cc --with-cxx=CC --with-fc=ftn 
--with-hip --with-hipc=hipcc 
--LIBS="-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -lmpi 
-L/opt/cray/pe/mpich/8.1.23/gtl/lib -lmpi_gtl_hsa" --download-kokkos 
--download-kokkos-kernels --download-suitesparse --download-hypre 
--download-cmake
-----------------------------------------
Libraries compiled on 2024-03-05 17:04:36 on login08
Machine characteristics: 
Linux-5.14.21-150400.24.46_12.0.83-cray_shasta_c-x86_64-with-glibc2.3.4
Using PETSc directory: /autofs/nccs-svm1_home1/vanellam/Software/petsc
Using PETSc arch: arch-linux-frontier-opt-gcc
-----------------------------------------

Using C compiler: cc  -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas 
-Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector 
-fvisibility=hidden -O3
Using Fortran compiler: ftn  -fPIC -Wall -ffree-line-length-none 
-ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3
-----------------------------------------

Using include paths: -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/include 
-I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include
 
-I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include/suitesparse
 -I/opt/rocm-5.4.0/include
-----------------------------------------

Using C linker: cc
Using Fortran linker: ftn
Using libraries: 
-Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
 
-L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
 -lpetsc 
-Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
 
-L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib
 -Wl,-rpath,/opt/rocm-5.4.0/lib -L/opt/rocm-5.4.0/lib 
-Wl,-rpath,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib 
-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib 
-Wl,-rpath,/opt/cray/pe/mpich/8.1.23/gtl/lib 
-L/opt/cray/pe/mpich/8.1.23/gtl/lib 
-Wl,-rpath,/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib<https://urldefense.us/v3/__http://22.12.1.1/GNU/9.1/x86_64/lib__;!!G_uCfscf7eWS!aFbqY6mG8CYI7pOoYJmZ-HjqKHrVjfZmct6pBKsFlSH9QcnbP3D9Y7KXN29dYiioBs6nwMBVVlOT1Jz2OwxWqLWCqlJgQXrP$
 > 
-L/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib<https://urldefense.us/v3/__http://22.12.1.1/GNU/9.1/x86_64/lib__;!!G_uCfscf7eWS!aFbqY6mG8CYI7pOoYJmZ-HjqKHrVjfZmct6pBKsFlSH9QcnbP3D9Y7KXN29dYiioBs6nwMBVVlOT1Jz2OwxWqLWCqlJgQXrP$
 > 
-Wl,-rpath,/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
 
-L/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib
 -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib 
-L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.1.8/lib 
-L/opt/cray/pe/pmi/6.1.8/lib 
-Wl,-rpath,/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 
-L/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 
-Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 
-L/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 
-Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib64 
-L/opt/cray/pe/gcc/12.2.0/snos/lib64 -Wl,-rpath,/opt/rocm-5.4.0/llvm/lib 
-L/opt/rocm-5.4.0/llvm/lib -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib 
-L/opt/cray/pe/gcc/12.2.0/snos/lib -lHYPRE -lspqr -lumfpack -lklu -lcholmod 
-lamd -lkokkoskernels -lkokkoscontainers -lkokkoscore -lkokkossimd -lhipsparse 
-lhipblas -lhipsolver -lrocsparse -lrocsolver -lrocblas -lrocrand -lamdhip64 
-lmpi -lmpi_gtl_hsa -ldarshan -lz -ldl -lxpmem -lgfortran -lm -lmpifort_gnu_91 
-lmpi_gnu_91 -lsci_gnu_82_mpi -lsci_gnu_82 -ldsmml -lpmi -lpmi2 -lgfortran 
-lquadmath -lpthread -lm -lgcc_s -lstdc++ -lquadmath -lmpi -lmpi_gtl_hsa
-----------------------------------------

Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs

Reply via email to