Hello,
I am trying to run ex45 (in KSP tutorial) using hypre on gpus. I have attached
the python configuration file and -log_view output from running the below
command options
mpirun -n 2 ./ex45 -log_view -da_grid_x 169 -da_grid_y 169 -da_grid_z 169
-dm_mat_type mpiaijcusparse -dm_vec_type mpicuda -ksp_type gmres -pc_type hypre
-pc_hypre_type boomeramg -ksp_gmres_restart 31
-pc_hypre_boomeramg_strong_threshold 0.7 -ksp_monitor
The problem was solved and converged but from the output file I suspect hypre
is not running on gpus as PCApply and DMCreate does not record any gpu Mflop/s.
However, some events such KSPSolve, MatMult etc are running on gpus.
Can you please let me know if I need to add any extra flag to the attached
arch-ci-linux-cuda11-double-xx.py script file to get hypre working on gpus?
Thanks,
Karthik.
This email and any attachments are intended solely for the use of the named
recipients. If you are not the intended recipient you must not use, disclose,
copy or distribute this email or any of its attachments and should notify the
sender immediately and delete this email from your system. UK Research and
Innovation (UKRI) has taken every reasonable precaution to minimise risk of
this email or any attachments containing viruses or malware but the recipient
should carry out its own virus and malware checks before opening the
attachments. UKRI does not accept any liability for any losses or damages which
the recipient may sustain due to presence of any viruses.
#!/usr/bin/python
import os
petsc_hash_pkgs=os.path.join(os.getenv('HOME'),'petsc-hash-pkgs')
if __name__ == '__main__':
import sys
import os
sys.path.insert(0, os.path.abspath('config'))
import configure
configure_options = [
'--package-prefix-hash='+petsc_hash_pkgs,
'--with-make-test-np=2',
'COPTFLAGS=-g -O',
'FOPTFLAGS=-g -O',
'CXXOPTFLAGS=-g -O',
'--with-blaslapack=1',
'--download-hypre=1',
'--with-cuda-dir=/apps/packages/cuda/10.1/',
'--with-mpi-dir=/apps/packages/gcc/7.3.0/openmpi/3.1.2',
# '--with-cuda-dir=/apps/packages/compilers/nvidia-hpcsdk/Linux_x86_64/20.7/cuda/11.0',
# '--with-mpi-dir=/apps/packages/mpi/nvidia-hpcsdk/20.7/openmpi/3.1.5',
]
configure.petsc_configure(configure_options)
0 KSP Residual norm 1.658481321480e+03
1 KSP Residual norm 3.270999311989e+02
2 KSP Residual norm 3.129531485499e+01
3 KSP Residual norm 2.351754477084e+00
4 KSP Residual norm 1.898053977239e-01
5 KSP Residual norm 1.611209883991e-02
Residual norm 0.000135673
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r
-fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
./ex45 on a named glados.dl.ac.uk with 2 processors, by kchockalingam Fri Oct
8 11:37:51 2021
Using Petsc Release Version 3.15.3, Aug 06, 2021
Max Max/Min Avg Total
Time (sec): 4.016e+01 1.000 4.016e+01
Objects: 4.600e+01 1.000 4.600e+01
Flop: 4.165e+08 1.012 4.141e+08 8.282e+08
Flop/sec: 1.037e+07 1.012 1.031e+07 2.062e+07
Memory: 3.595e+08 1.011 3.576e+08 7.151e+08
MPI Messages: 8.000e+00 1.000 8.000e+00 1.600e+01
MPI Message Lengths: 1.485e+06 1.000 1.856e+05 2.970e+06
MPI Reductions: 4.720e+02 1.000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flop
and VecAXPY() for complex vectors of length N -->
8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total
Avg %Total Count %Total
0: Main Stage: 4.0163e+01 100.0% 8.2816e+08 100.0% 1.600e+01 100.0%
1.856e+05 100.0% 4.530e+02 96.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all
processors)
GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time
over all processors)
CpuToGpu Count: total number of CPU to GPU copies per processor
CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
processor)
GpuToCpu Count: total number of GPU to CPU copies per processor
GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
processor)
GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 4 1.0 5.4248e-05 1.1 0.00e+00 0.0 2.0e+00 4.0e+00
8.0e+00 0 0 12 0 2 0 0 12 0 2 0 0 0 0.00e+00 0
0.00e+00 0
BuildTwoSidedF 3 1.0 6.4760e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
6.0e+00 0 0 0 0 1 0 0 0 0 1 0 0 0 0.00e+00 0
0.00e+00 0
MatMult 6 1.0 1.0370e-01 1.0 1.88e+08 1.0 1.6e+01 1.9e+05
2.0e+00 0 45100100 0 0 45100100 0 3611 112547 2 2.12e+02 0
0.00e+00 100
MatConvert 1 1.0 2.5040e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
5.2e+01 6 0 0 0 11 6 0 0 0 11 0 0 0 0.00e+00 0
0.00e+00 0
MatAssemblyBegin 3 1.0 7.6855e-02308.5 0.00e+00 0.0 0.0e+00 0.0e+00
1.2e+01 0 0 0 0 3 0 0 0 0 3 0 0 0 0.00e+00 0
0.00e+00 0
MatAssemblyEnd 3 1.0 1.2646e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.5e+01 3 0 0 0 10 3 0 0 0 10 0 0 0 0.00e+00 0
0.00e+00 0
MatCUSPARSCopyTo 2 1.0 3.7301e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 2 2.12e+02 0
0.00e+00 0
KSPSetUp 1 1.0 3.5023e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
4.0e+01 0 0 0 0 8 0 0 0 0 9 0 0 0 0.00e+00 0
0.00e+00 0
KSPSolve 1 1.0 6.2424e+00 1.0 3.75e+08 1.0 1.4e+01 1.8e+05
1.9e+02 16 90 88 85 40 16 90 88 85 42 120 54599 21 3.27e+02 16
9.65e+01 100
KSPGMRESOrthog 5 1.0 2.7125e-02 1.4 1.46e+08 1.0 0.0e+00 0.0e+00
4.5e+01 0 35 0 0 10 0 35 0 0 10 10677 95449 10 9.65e+01 5
1.33e-02 100
DMCreateMat 1 1.0 2.3494e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
6.5e+01 6 0 0 0 14 6 0 0 0 14 0 0 0 0.00e+00 0
0.00e+00 0
SFSetGraph 2 1.0 2.4669e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
SFSetUp 1 1.0 6.8134e-04 1.0 0.00e+00 0.0 4.0e+00 5.7e+04
2.0e+00 0 0 25 8 0 0 0 25 8 0 0 0 0 0.00e+00 0
0.00e+00 0
SFPack 6 1.0 4.9323e-06 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
SFUnpack 6 1.0 3.7476e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecMDot 5 1.0 1.7741e-02 1.0 7.28e+07 1.0 0.0e+00 0.0e+00
1.0e+01 0 17 0 0 2 0 17 0 0 2 8162 95721 5 9.65e+01 5
1.33e-02 100
VecNorm 7 1.0 4.1496e-03 1.5 3.40e+07 1.0 0.0e+00 0.0e+00
1.4e+01 0 8 0 0 3 0 8 0 0 3 16285 80156 1 1.93e+01 7
5.60e-05 100
VecScale 6 1.0 4.4610e-04 1.0 1.46e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 3 0 0 0 0 3 0 0 0 64920 66623 6 4.80e-05 0
0.00e+00 100
VecCopy 1 1.0 2.4004e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecSet 27 1.0 6.9875e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecAXPY 2 1.0 3.3899e-03 1.0 9.71e+06 1.0 0.0e+00 0.0e+00
0.0e+00 0 2 0 0 0 0 2 0 0 0 5696 87237 3 1.93e+01 0
0.00e+00 100
VecMAXPY 6 1.0 2.0297e-03 1.0 9.71e+07 1.0 0.0e+00 0.0e+00
0.0e+00 0 23 0 0 0 0 23 0 0 0 95122 95527 6 1.60e-04 0
0.00e+00 100
VecScatterBegin 6 1.0 5.7104e-02 1.0 0.00e+00 0.0 1.6e+01 1.9e+05
2.0e+00 0 0100100 0 0 0100100 0 0 0 0 0.00e+00 0
0.00e+00 0
VecScatterEnd 6 1.0 6.1680e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
0.00e+00 0
VecNormalize 6 1.0 4.4475e-03 1.5 4.37e+07 1.0 0.0e+00 0.0e+00
1.2e+01 0 10 0 0 3 0 10 0 0 3 19535 75736 7 1.93e+01 6
4.80e-05 100
VecCUDACopyTo 7 1.0 2.0760e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 7 1.35e+02 0
0.00e+00 0
VecCUDACopyFrom 5 1.0 1.4753e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 5
9.65e+01 0
PCSetUp 1 1.0 3.0190e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
5.6e+01 75 0 0 0 12 75 0 0 0 12 0 0 0 0.00e+00 0
0.00e+00 0
PCApply 6 1.0 6.0335e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
6.0e+00 15 0 0 0 1 15 0 0 0 1 0 0 0 0.00e+00 5
9.65e+01 0
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Krylov Solver 1 1 19916 0.
DMKSP interface 1 1 664 0.
Matrix 5 5 300130304 0.
Distributed Mesh 1 1 5560 0.
Index Set 4 4 9942836 0.
IS L to G Mapping 1 1 9825664 0.
Star Forest Graph 4 4 4896 0.
Discrete System 1 1 904 0.
Weak Form 1 1 824 0.
Vector 24 24 349857688 0.
Preconditioner 1 1 1496 0.
Viewer 2 1 848 0.
========================================================================================================================
Average time to get PetscTime(): 2.6077e-08
Average time for MPI_Barrier(): 9.49204e-07
Average time for zero size MPI_Send(): 5.80028e-06
#PETSc Option Table entries:
-da_grid_x 169
-da_grid_y 169
-da_grid_z 169
-dm_mat_type mpiaijcusparse
-dm_vec_type mpicuda
-ksp_gmres_restart 31
-ksp_monitor
-ksp_type gmres
-log_view
-pc_hypre_boomeramg_strong_threshold 0.7
-pc_hypre_type boomeramg
-pc_type hypre
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --package-prefix-hash=/home/kchockalingam/petsc-hash-pkgs
--with-make-test-np=2 COPTFLAGS="-g -O" FOPTFLAGS="-g -O" CXXOPTFLAGS="-g -O"
--with-blaslapack=1 --download-hypre=1
--with-cuda-dir=/apps/packages/cuda/10.1/
--with-mpi-dir=/apps/packages/gcc/7.3.0/openmpi/3.1.2
PETSC_ARCH=arch-ci-linux-cuda11-double
-----------------------------------------
Libraries compiled on 2021-10-05 14:38:14 on glados.dl.ac.uk
Machine characteristics:
Linux-4.18.0-193.6.3.el8_2.x86_64-x86_64-with-centos-8.2.2004-Core
Using PETSc directory: /home/kchockalingam/tools/petsc-3.15.3
Using PETSc arch:
-----------------------------------------
Using C compiler: /apps/packages/gcc/7.3.0/openmpi/3.1.2/bin/mpicc -fPIC -Wall
-Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector
-fvisibility=hidden -g -O
Using Fortran compiler: /apps/packages/gcc/7.3.0/openmpi/3.1.2/bin/mpif90
-fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O
-----------------------------------------
Using include paths: -I/home/kchockalingam/tools/petsc-3.15.3/include
-I/home/kchockalingam/tools/petsc-3.15.3/arch-ci-linux-cuda11-double/include
-I/home/kchockalingam/petsc-hash-pkgs/d71384/include
-I/apps/packages/gcc/7.3.0/openmpi/3.1.2/include
-I/apps/packages/cuda/10.1/include
-----------------------------------------
Using C linker: /apps/packages/gcc/7.3.0/openmpi/3.1.2/bin/mpicc
Using Fortran linker: /apps/packages/gcc/7.3.0/openmpi/3.1.2/bin/mpif90
Using libraries: -Wl,-rpath,/home/kchockalingam/tools/petsc-3.15.3/lib
-L/home/kchockalingam/tools/petsc-3.15.3/lib -lpetsc
-Wl,-rpath,/home/kchockalingam/petsc-hash-pkgs/d71384/lib
-L/home/kchockalingam/petsc-hash-pkgs/d71384/lib
-Wl,-rpath,/apps/packages/cuda/10.1/lib64 -L/apps/packages/cuda/10.1/lib64
-Wl,-rpath,/apps/packages/gcc/7.3.0/openmpi/3.1.2/lib
-L/apps/packages/gcc/7.3.0/openmpi/3.1.2/lib
-Wl,-rpath,/apps/packages/compilers/gcc/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0
-L/apps/packages/compilers/gcc/7.3.0/lib/gcc/x86_64-pc-linux-gnu/7.3.0
-Wl,-rpath,/apps/packages/compilers/gcc/7.3.0/lib64
-L/apps/packages/compilers/gcc/7.3.0/lib64
-Wl,-rpath,/apps/packages/compilers/gcc/7.3.0/lib
-L/apps/packages/compilers/gcc/7.3.0/lib -lHYPRE -llapack -lblas -lcufft
-lcublas -lcudart -lcusparse -lcusolver -lcurand -lX11 -lstdc++ -ldl
-lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lutil -lrt -lz
-lgfortran -lm -lgfortran -lgcc_s -lquadmath -lpthread -lquadmath -lstdc++ -ldl
-----------------------------------------
##########################################################
# #
# WARNING!!! #
# #
# This code was compiled with a debugging option. #
# To get timing results run ./configure #
# using --with-debugging=no, the performance will #
# be generally two or three times faster. #
# #
##########################################################