I reproduced this HIPSPARSE_STATUS_INVALID_VALUE error, but have not yet found obvious input argument errors for this hipsparse call.
On Fri, Jan 19, 2024 at 2:18 PM Barry Smith <bsm...@petsc.dev> wrote: > > Junchao > > I run the following on the CI machine, why does this happen? With > trivial solver options it runs ok. > > bsmith@petsc-gpu-02:/scratch/bsmith/petsc/src/ksp/ksp/tutorials$ ./ex34 > -da_grid_x 192 -da_grid_y 192 -da_grid_z 192 -dm_mat_type seqaijhipsparse > -dm_vec_type seqhip -ksp_max_it 10 -ksp_monitor -ksp_type richardson > -ksp_view -log_view -mg_coarse_ksp_max_it 2 -mg_coarse_ksp_type richardson > -mg_coarse_pc_type none -mg_levels_ksp_type richardson -mg_levels_pc_type > none -options_left -pc_mg_levels 3 -pc_mg_log -pc_type mg > > *[0]PETSC ERROR: --------------------- Error Message > --------------------------------------------------------------* > > [0]PETSC ERROR: GPU error > > [0]PETSC ERROR: hipSPARSE errorcode 3 (HIPSPARSE_STATUS_INVALID_VALUE) > > [0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the > program crashed before usage or a spelling mistake, etc! > > [0]PETSC ERROR: Option left: name:-options_left (no value) source: > command line > > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. > > [0]PETSC ERROR: Petsc Release Version 3.20.3, unknown > > [0]PETSC ERROR: ./ex34 on a named petsc-gpu-02 by bsmith Fri Jan 19 > 14:15:20 2024 > > [0]PETSC ERROR: Configure options > --package-prefix-hash=/home/bsmith/petsc-hash-pkgs --with-make-np=24 > --with-make-test-np=8 --with-hipc=/opt/rocm-5.4.3/bin/hipcc > --with-hip-dir=/opt/rocm-5.4.3 COPTFLAGS="-g -O" FOPTFLAGS="-g -O" > CXXOPTFLAGS="-g -O" HIPOPTFLAGS="-g -O" --with-cuda=0 --with-hip=1 > --with-precision=double --with-clanguage=c --download-kokkos > --download-kokkos-kernels --download-hypre --download-magma > --with-magma-fortran-bindings=0 --download-mfem --download-metis > --with-strict-petscerrorcode PETSC_ARCH=arch-ci-linux-hip-double > > [0]PETSC ERROR: #1 MatMultAddKernel_SeqAIJHIPSPARSE() at > /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3131 > > [0]PETSC ERROR: #2 MatMultAdd_SeqAIJHIPSPARSE() at > /scratch/bsmith/petsc/src/mat/impls/aij/seq/seqhipsparse/aijhipsparse.hip.cpp:3004 > > [0]PETSC ERROR: #3 MatMultAdd() at > /scratch/bsmith/petsc/src/mat/interface/matrix.c:2770 > > [0]PETSC ERROR: #4 MatInterpolateAdd() at > /scratch/bsmith/petsc/src/mat/interface/matrix.c:8603 > > [0]PETSC ERROR: #5 PCMGMCycle_Private() at > /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:87 > > [0]PETSC ERROR: #6 PCMGMCycle_Private() at > /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:83 > > [0]PETSC ERROR: #7 PCApply_MG_Internal() at > /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:611 > > [0]PETSC ERROR: #8 PCApply_MG() at > /scratch/bsmith/petsc/src/ksp/pc/impls/mg/mg.c:633 > > [0]PETSC ERROR: #9 PCApply() at > /scratch/bsmith/petsc/src/ksp/pc/interface/precon.c:498 > > [0]PETSC ERROR: #10 KSP_PCApply() at > /scratch/bsmith/petsc/include/petsc/private/kspimpl.h:383 > > [0]PETSC ERROR: #11 KSPSolve_Richardson() at > /scratch/bsmith/petsc/src/ksp/ksp/impls/rich/rich.c:106 > > [0]PETSC ERROR: #12 KSPSolve_Private() at > /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:906 > > [0]PETSC ERROR: #13 KSPSolve() at > /scratch/bsmith/petsc/src/ksp/ksp/interface/itfunc.c:1079 > > [0]PETSC ERROR: #14 main() at ex34.c:52 > > [0]PETSC ERROR: PETSc Option Table entries: > > Dave, > > Trying to debug the 7% now, but having trouble running, as you see > above. > > > > On Jan 19, 2024, at 3:02 PM, Dave May <dave.mayhe...@gmail.com> wrote: > > Thank you Barry and Junchao for these explanations. I'll turn on > -log_view_gpu_time. > > Do either of you have any thoughts regarding why the percentage of flop's > being reported on the GPU is not 100% for MGSmooth Level {0,1,2} for this > solver configuration? > > This number should have nothing to do with timings as it reports the ratio > of operations performed on the GPU and CPU, presumably obtained from > PetscLogFlops() and PetscLogGpuFlops(). > > Cheers, > Dave > > On Fri, 19 Jan 2024 at 11:39, Junchao Zhang <junchao.zh...@gmail.com> > wrote: > >> Try to also add -log_view_gpu_time, >> https://petsc.org/release/manualpages/Profiling/PetscLogGpuTime/ >> >> --Junchao Zhang >> >> >> On Fri, Jan 19, 2024 at 11:35 AM Dave May <dave.mayhe...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I am trying to understand the logging information associated with the >>> %flops-performed-on-the-gpu reported by -log_view when running >>> src/ksp/ksp/tutorials/ex34 >>> with the following options >>> -da_grid_x 192 >>> -da_grid_y 192 >>> -da_grid_z 192 >>> -dm_mat_type seqaijhipsparse >>> -dm_vec_type seqhip >>> -ksp_max_it 10 >>> -ksp_monitor >>> -ksp_type richardson >>> -ksp_view >>> -log_view >>> -mg_coarse_ksp_max_it 2 >>> -mg_coarse_ksp_type richardson >>> -mg_coarse_pc_type none >>> -mg_levels_ksp_type richardson >>> -mg_levels_pc_type none >>> -options_left >>> -pc_mg_levels 3 >>> -pc_mg_log >>> -pc_type mg >>> >>> This config is not intended to actually solve the problem, rather it is >>> a stripped down set of options designed to understand what parts of the >>> smoothers are being executed on the GPU. >>> >>> With respect to the log file attached, my first set of questions related >>> to the data reported under "Event Stage 2: MG Apply". >>> >>> [1] Why is the log littered with nan's? >>> * I don't understand how and why "GPU Mflop/s" should be reported as nan >>> when a value is given for "GPU %F" (see MatMult for example). >>> >>> * For events executed on the GPU, I assume the column "Time (sec)" >>> relates to "CPU execute time", this would explain why we see a nan in "Time >>> (sec)" for MatMult. >>> If my assumption is correct, how should I interpret the column "Flop >>> (Max)" which is showing 1.92e+09? >>> I would assume of "Time (sec)" relates to the CPU then "Flop (Max)" >>> should also relate to CPU and GPU flops would be logged in "GPU Mflop/s" >>> >>> [2] More curious is that within "Event Stage 2: MG Apply" KSPSolve, >>> MGSmooth Level 0, MGSmooth Level 1, MGSmooth Level 2 all report "GPU %F" as >>> 93. I believe this value should be 100 as the smoother (and coarse grid >>> solver) are configured as richardson(2)+none and thus should run entirely >>> on the GPU. >>> Furthermore, when one inspects all events listed under "Event Stage 2: >>> MG Apply" those events which do flops correctly report "GPU %F" as 100. >>> And the events showing "GPU %F" = 0 such as >>> MatHIPSPARSCopyTo, VecCopy, VecSet, PCApply, DCtxSync >>> don't do any flops (on the CPU or GPU) - which is also correct >>> (although non GPU events should show nan??) >>> >>> Hence I am wondering what is the explanation for the missing 7% from >>> "GPU %F" for KSPSolve and MGSmooth {0,1,2}?? >>> >>> Does anyone understand this -log_view, or can explain to me how to >>> interpret it? >>> >>> It could simply be that: >>> a) something is messed up with -pc_mg_log >>> b) something is messed up with the PETSc build >>> c) I am putting too much faith in -log_view and should profile the code >>> differently. >>> >>> Either way I'd really like to understand what is going on. >>> >>> >>> Cheers, >>> Dave >>> >>> >>> >>> >