Thank you Mark, I'll try the options you suggest to get more info. I'm also building PETSc and the code with the cray compiler suite to test. The test I'm running has 1 million unknowns. I was able to see good scaling up to 4 gpus on this case in Polaris. Talk soon, Marcos ________________________________ From: Mark Adams <[email protected]> Sent: Tuesday, March 5, 2024 2:41 PM To: Vanella, Marcos (Fed) <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: [petsc-users] Running CG with HYPRE AMG preconditioner in AMD GPUs
You can run with -log_view_gpu_time to get rid of the nans and get more data. You can run with -ksp_view to get more info on the solver and send that output. -options_left is also good to use so we can see what parameters you used. The last 100 in this row: KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05 0 0.00e+00 100 tells us that all the flops were logged on GPUs. You do need at least 100K equations per GPU to see speedup, so don't worry about small problems. Mark On Tue, Mar 5, 2024 at 12:52 PM Vanella, Marcos (Fed) via petsc-users <[email protected]<mailto:[email protected]>> wrote: Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos and hip options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Hi all, I compiled the latest PETSc source in Frontier using gcc+kokkos and hip options: ./configure COPTFLAGS="-O3" CXXOPTFLAGS="-O3" FOPTFLAGS="-O3" FCOPTFLAGS="-O3" HIPOPTFLAGS="-O3" --with-debugging=0 --with-cc=cc --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc --LIBS="-L${MPICH_DIR}/lib -lmpi ${PE_MPICH_GTL_DIR_amd_gfx90a} ${PE_MPICH_GTL_LIBS_amd_gfx90a}" --download-kokkos --download-kokkos-kernels --download-suitesparse --download-hypre --download-cmake and have started testing our code solving a Poisson linear system with CG + HYPRE preconditioner. Timings look rather high compared to compilations done on other machines that have NVIDIA cards. They are also not changing when using more than one GPU for the simple test I doing. Does anyone happen to know if HYPRE has an hip GPU implementation for Boomer AMG and is it compiled when configuring PETSc? Thanks! Marcos PS: This is what I see on the log file (-log_view) when running the case with 2 GPUs in the node: ------------------------------------------------------------------ PETSc Performance Summary: ------------------------------------------------------------------ /ccs/home/vanellam/Firemodels_fork/fds/Build/mpich_gnu_frontier/fds_mpich_gnu_frontier on a arch-linux-frontier-opt-gcc named frontier04119 with 4 processors, by vanellam Tue Mar 5 12:42:29 2024 Using Petsc Development GIT revision: v3.20.5-713-gabdf6bc0fcf GIT Date: 2024-03-05 01:04:54 +0000 Max Max/Min Avg Total Time (sec): 8.368e+02 1.000 8.368e+02 Objects: 0.000e+00 0.000 0.000e+00 Flops: 2.546e+11 0.000 1.270e+11 5.079e+11 Flops/sec: 3.043e+08 0.000 1.518e+08 6.070e+08 MPI Msg Count: 1.950e+04 0.000 9.748e+03 3.899e+04 MPI Msg Len (bytes): 1.560e+09 0.000 7.999e+04 3.119e+09 MPI Reductions: 6.331e+04 2877.545 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flops and VecAXPY() for complex vectors of length N --> 8N flops Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 8.3676e+02 100.0% 5.0792e+11 100.0% 3.899e+04 100.0% 7.999e+04 100.0% 3.164e+04 50.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors) CpuToGpu Count: total number of CPU to GPU copies per processor CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor) GpuToCpu Count: total number of GPU to CPU copies per processor GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor) GPU %F: percent flops on GPU in this event ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --------------------------------------------------------------------------------------------------------------------------------------------------------------- --- Event Stage 0: Main Stage BuildTwoSided 1201 0.0 nan nan 0.00e+00 0.0 2.0e+00 4.0e+00 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 0.00e+00 0 BuildTwoSidedF 1200 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatMult 19494 0.0 nan nan 1.35e+11 0.0 3.9e+04 8.0e+04 0.0e+00 7 53 100 100 0 7 53 100 100 0 -nan -nan 0 1.80e-05 0 0.00e+00 100 MatConvert 3 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatAssemblyBegin 2 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 MatAssemblyEnd 2 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 3.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecTDot 41382 0.0 nan nan 4.14e+10 0.0 0.0e+00 0.0e+00 2.1e+04 0 16 0 0 33 0 16 0 0 65 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecNorm 20691 0.0 nan nan 2.07e+10 0.0 0.0e+00 0.0e+00 1.0e+04 0 8 0 0 16 0 8 0 0 33 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecCopy 2394 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecSet 21888 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecAXPY 38988 0.0 nan nan 3.90e+10 0.0 0.0e+00 0.0e+00 0.0e+00 0 15 0 0 0 0 15 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecAYPX 18297 0.0 nan nan 1.83e+10 0.0 0.0e+00 0.0e+00 0.0e+00 0 7 0 0 0 0 7 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 100 VecAssemblyBegin 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+02 0 0 0 0 1 0 0 0 0 2 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecAssemblyEnd 1197 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 VecScatterBegin 19494 0.0 nan nan 0.00e+00 0.0 3.9e+04 8.0e+04 0.0e+00 0 0 100 100 0 0 0 100 100 0 -nan -nan 0 1.80e-05 0 0.00e+00 0 VecScatterEnd 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFSetGraph 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFSetUp 1 0.0 nan nan 0.00e+00 0.0 4.0e+00 2.0e+04 5.0e-01 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 SFPack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 1.80e-05 0 0.00e+00 0 SFUnpack 19494 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 KSPSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 KSPSolve 1197 0.0 2.0291e+02 0.0 2.55e+11 0.0 3.9e+04 8.0e+04 3.1e+04 12 100 100 100 49 12 100 100 100 98 2503 -nan 0 1.80e-05 0 0.00e+00 100 PCSetUp 1 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 1.5e+00 0 0 0 0 0 0 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 PCApply 20691 0.0 nan nan 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 5 0 0 0 0 5 0 0 0 0 -nan -nan 0 0.00e+00 0 0.00e+00 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Object Type Creations Destructions. Reports information only for process 0. --- Event Stage 0: Main Stage Matrix 7 3 Vector 7 1 Index Set 2 2 Star Forest Graph 1 0 Krylov Solver 1 0 Preconditioner 1 0 ======================================================================================================================== Average time to get PetscTime(): 3.01e-08 Average time for MPI_Barrier(): 3.8054e-06 Average time for zero size MPI_Send(): 7.101e-06 #PETSc Option Table entries: -log_view # (source: command line) -mat_type mpiaijkokkos # (source: command line) -vec_type kokkos # (source: command line) #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4 Configure options: COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 FCOPTFLAGS=-O3 HIPOPTFLAGS=-O3 --with-debugging=0 --with-cc=cc --with-cxx=CC --with-fc=ftn --with-hip --with-hipc=hipcc --LIBS="-L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -lmpi -L/opt/cray/pe/mpich/8.1.23/gtl/lib -lmpi_gtl_hsa" --download-kokkos --download-kokkos-kernels --download-suitesparse --download-hypre --download-cmake ----------------------------------------- Libraries compiled on 2024-03-05 17:04:36 on login08 Machine characteristics: Linux-5.14.21-150400.24.46_12.0.83-cray_shasta_c-x86_64-with-glibc2.3.4 Using PETSc directory: /autofs/nccs-svm1_home1/vanellam/Software/petsc Using PETSc arch: arch-linux-frontier-opt-gcc ----------------------------------------- Using C compiler: cc -fPIC -Wall -Wwrite-strings -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow -fstack-protector -fvisibility=hidden -O3 Using Fortran compiler: ftn -fPIC -Wall -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch -Wno-unused-dummy-argument -O3 ----------------------------------------- Using include paths: -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/include -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include -I/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/include/suitesparse -I/opt/rocm-5.4.0/include ----------------------------------------- Using C linker: cc Using Fortran linker: ftn Using libraries: -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib -lpetsc -Wl,-rpath,/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib -L/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-frontier-opt-gcc/lib -Wl,-rpath,/opt/rocm-5.4.0/lib -L/opt/rocm-5.4.0/lib -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -L/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib -Wl,-rpath,/opt/cray/pe/mpich/8.1.23/gtl/lib -L/opt/cray/pe/mpich/8.1.23/gtl/lib -Wl,-rpath,/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib<https://urldefense.us/v3/__http://22.12.1.1/GNU/9.1/x86_64/lib__;!!G_uCfscf7eWS!YrJBJyEc5KGMbySL6LqDZiKMfHvLbhX0jp0hNVluceLhr3Iruk1PQI6Cwh_gd-q3khrOSQNE6i1O45T6gv_0mjN6AvBhNTSL$ > -L/opt/cray/pe/libsci/22.12.1.1/GNU/9.1/x86_64/lib<https://urldefense.us/v3/__http://22.12.1.1/GNU/9.1/x86_64/lib__;!!G_uCfscf7eWS!YrJBJyEc5KGMbySL6LqDZiKMfHvLbhX0jp0hNVluceLhr3Iruk1PQI6Cwh_gd-q3khrOSQNE6i1O45T6gv_0mjN6AvBhNTSL$ > -Wl,-rpath,/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib -L/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/darshan-runtime-3.4.0-ftq5gccg3qjtyh5xeo2bz4wqkjayjhw3/lib -Wl,-rpath,/opt/cray/pe/dsmml/0.2.2/dsmml/lib -L/opt/cray/pe/dsmml/0.2.2/dsmml/lib -Wl,-rpath,/opt/cray/pe/pmi/6.1.8/lib -L/opt/cray/pe/pmi/6.1.8/lib -Wl,-rpath,/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 -L/opt/cray/xpmem/2.6.2-2.5_2.22__gd067c3f.shasta/lib64 -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 -L/opt/cray/pe/gcc/12.2.0/snos/lib/gcc/x86_64-suse-linux/12.2.0 -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib64 -L/opt/cray/pe/gcc/12.2.0/snos/lib64 -Wl,-rpath,/opt/rocm-5.4.0/llvm/lib -L/opt/rocm-5.4.0/llvm/lib -Wl,-rpath,/opt/cray/pe/gcc/12.2.0/snos/lib -L/opt/cray/pe/gcc/12.2.0/snos/lib -lHYPRE -lspqr -lumfpack -lklu -lcholmod -lamd -lkokkoskernels -lkokkoscontainers -lkokkoscore -lkokkossimd -lhipsparse -lhipblas -lhipsolver -lrocsparse -lrocsolver -lrocblas -lrocrand -lamdhip64 -lmpi -lmpi_gtl_hsa -ldarshan -lz -ldl -lxpmem -lgfortran -lm -lmpifort_gnu_91 -lmpi_gnu_91 -lsci_gnu_82_mpi -lsci_gnu_82 -ldsmml -lpmi -lpmi2 -lgfortran -lquadmath -lpthread -lm -lgcc_s -lstdc++ -lquadmath -lmpi -lmpi_gtl_hsa -----------------------------------------
