Junchao

    I run the following on the CI machine, why does this happen? With trivial 
solver options it runs ok.

bsmith@petsc-gpu-02:/scratch/bsmith/petsc/src/ksp/ksp/tutorials$ ./ex34 
-da_grid_x 192 -da_grid_y 192 -da_grid_z 192 -dm_mat_type seqaijhipsparse 
-dm_vec_type seqhip -ksp_max_it 10 -ksp_monitor -ksp_type richardson -ksp_view 
-log_view -mg_coarse_ksp_max_it 2 -mg_coarse_ksp_type richardson 
-mg_coarse_pc_type none -mg_levels_ksp_type richardson -mg_levels_pc_type none 
-options_left -pc_mg_levels 3 -pc_mg_log -pc_type mg
[0]PETSC ERROR: --------------------- Error Message 
--------------------------------------------------------------
[0]PETSC ERROR: GPU error
[0]PETSC ERROR: hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE)
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program 
crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-options_left (no value) source: command 
line
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.20.3, unknown 
[0]PETSC ERROR: ./ex34 on a  named petsc-gpu-02 by bsmith Fri Jan 19 14:15:20 
2024
[0]PETSC ERROR: Configure options 
--package-prefix-hash=/home/bsmith/petsc-hash-pkgs --with-make-np=24 
--with-make-test-np=8 --with-hipc=/opt/rocm-5.4.3/bin/hipcc 
--with-hip-dir=/opt/rocm-5.4.3 COPTFLAGS="-g -O" FOPTFLAGS="-g -O" 
CXXOPTFLAGS="-g -O" HIPOPTFLAGS="-g -O" --with-cuda=0 --with-hip=1 
--with-precision=double --with-clanguage=c --download-kokkos 
--download-kokkos-kernels --download-hypre --download-magma 
--with-magma-fortran-bindings=0 --download-mfem --download-metis 
--with-strict-petscerrorcode PETSC_ARCH=arch-ci-linux-hip-double
[0]PETSC ERROR: #1 MatMultAddKernel_SeqAIJHIPSPARSE() at 
/scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3131
[0]PETSC ERROR: #2 MatMultAdd_SeqAIJHIPSPARSE() at 
/scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3004
[0]PETSC ERROR: #3 MatMultAdd() at 
/scratch/bsmith/petsc/src/mat/interface/matrix.c:2770
[0]PETSC ERROR: #4 MatInterpolateAdd() at 
/scratch/bsmith/petsc/src/mat/interface/matrix.c:8603
[0]PETSC ERROR: #5 PCMGMCycle_Private() at 
/scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:87
[0]PETSC ERROR: #6 PCMGMCycle_Private() at 
/scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:83
[0]PETSC ERROR: #7 PCApply_MG_Internal() at 
/scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:611
[0]PETSC ERROR: #8 PCApply_MG() at 
/scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:633
[0]PETSC ERROR: #9 PCApply() at 
/scratch/bsmith/petsc/src/ksp/pc/interface/precon.c:498
[0]PETSC ERROR: #10 KSP_PCApply() at 
/scratch/bsmith/petsc/include/petsc/private/kspimpl.h:383
[0]PETSC ERROR: #11 KSPSolve_Richardson() at 
/scratch/bsmith/petsc/src/ksp/ksp/impls/rich/rich.c:106
[0]PETSC ERROR: #12 KSPSolve_Private() at 
/scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:906
[0]PETSC ERROR: #13 KSPSolve() at 
/scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:1079
[0]PETSC ERROR: #14 main() at ex34.c:52
[0]PETSC ERROR: PETSc Option Table entries:

  Dave,

    Trying to debug the 7% now, but having trouble running, as you see above.



> On Jan 19, 2024, at 3:02 PM, Dave May <[email protected]> wrote:
> 
> Thank you Barry and Junchao for these explanations. I'll turn on 
> -log_view_gpu_time.
> 
> Do either of you have any thoughts regarding why the percentage of flop's 
> being reported on the GPU is not 100% for MGSmooth Level {0,1,2} for this 
> solver configuration?
> 
> This number should have nothing to do with timings as it reports the ratio of 
> operations performed on the GPU and CPU, presumably obtained from 
> PetscLogFlops() and PetscLogGpuFlops().
> 
> Cheers,
> Dave
> 
> On Fri, 19 Jan 2024 at 11:39, Junchao Zhang <[email protected] 
> <mailto:[email protected]>> wrote:
>> Try to also add -log_view_gpu_time, 
>> https://petsc.org/release/manualpages/Profiling/PetscLogGpuTime/
>> 
>> --Junchao Zhang
>> 
>> 
>> On Fri, Jan 19, 2024 at 11:35 AM Dave May <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Hi all,
>>> 
>>> I am trying to understand the logging information associated with the 
>>> %flops-performed-on-the-gpu reported by -log_view when running 
>>>   src/ksp/ksp/tutorials/ex34
>>> with the following options
>>> -da_grid_x 192
>>> -da_grid_y 192
>>> -da_grid_z 192
>>> -dm_mat_type seqaijhipsparse
>>> -dm_vec_type seqhip
>>> -ksp_max_it 10
>>> -ksp_monitor
>>> -ksp_type richardson
>>> -ksp_view
>>> -log_view
>>> -mg_coarse_ksp_max_it 2
>>> -mg_coarse_ksp_type richardson
>>> -mg_coarse_pc_type none
>>> -mg_levels_ksp_type richardson
>>> -mg_levels_pc_type none
>>> -options_left
>>> -pc_mg_levels 3
>>> -pc_mg_log
>>> -pc_type mg
>>> 
>>> This config is not intended to actually solve the problem, rather it is a 
>>> stripped down set of options designed to understand what parts of the 
>>> smoothers are being executed on the GPU.
>>> 
>>> With respect to the log file attached, my first set of questions related to 
>>> the data reported under "Event Stage 2: MG Apply".
>>> 
>>> [1] Why is the log littered with nan's?
>>> * I don't understand how and why "GPU Mflop/s" should be reported as nan 
>>> when a value is given for "GPU %F" (see MatMult for example).
>>> 
>>> * For events executed on the GPU, I assume the column "Time (sec)" relates 
>>> to "CPU execute time", this would explain why we see a nan in "Time (sec)" 
>>> for MatMult.
>>> If my assumption is correct, how should I interpret the column "Flop (Max)" 
>>> which is showing 1.92e+09? 
>>> I would assume of "Time (sec)" relates to the CPU then "Flop (Max)" should 
>>> also relate to CPU and GPU flops would be logged in "GPU Mflop/s"
>>> 
>>> [2] More curious is that within "Event Stage 2: MG Apply" KSPSolve, 
>>> MGSmooth Level 0, MGSmooth Level 1, MGSmooth Level 2 all report "GPU %F" as 
>>> 93. I believe this value should be 100 as the smoother (and coarse grid 
>>> solver) are configured as richardson(2)+none and thus should run entirely 
>>> on the GPU. 
>>> Furthermore, when one inspects all events listed under "Event Stage 2: MG 
>>> Apply" those events which do flops correctly report "GPU %F" as 100. 
>>> And the events showing "GPU %F" = 0 such as 
>>>   MatHIPSPARSCopyTo, VecCopy, VecSet, PCApply, DCtxSync
>>> don't do any flops (on the CPU or GPU) - which is also correct (although 
>>> non GPU events should show nan??)
>>> 
>>> Hence I am wondering what is the explanation for the missing 7% from "GPU 
>>> %F" for KSPSolve and MGSmooth {0,1,2}??
>>> 
>>> Does anyone understand this -log_view, or can explain to me how to 
>>> interpret it?
>>> 
>>> It could simply be that:
>>> a) something is messed up with -pc_mg_log
>>> b) something is messed up with the PETSc build
>>> c) I am putting too much faith in -log_view and should profile the code 
>>> differently.
>>> 
>>> Either way I'd really like to understand what is going on.
>>> 
>>> 
>>> Cheers,
>>> Dave
>>> 
>>> 
>>> 

Reply via email to