Read my comments ALL the way down, they go a long way.
> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
>
> Dear Hong and Barry,
>
> Thanks for the suggestions. So there could be some problems in my PETSc
> configuration? - but my PETSc lib was indeed compiled without the debug flags
> (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for the
> compiling and building of PETSc and my own fortran code. My Fortran compiling
> flags are simply like:
>
> -O3 -ffree-line-length-none -fastsse
>
> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but
> not my fortran code, as I don’t have any OMP optimizations in my program).
> Note the performance test results I listed yesterday (e.g. 4.08s with 41 bcgs
> iterations.) are without any CSR-array->PETSc translation overhead (only
> include the set and solve part).
PETSc doesn't use -openmp in any way for its solvers. Do not use this
option, it may be slowing the code down. Please send configure.log
>
> I have two questions about the performance difference -
>
> 1. Is ilu only factorized once for each iteration, or ilu is performed at
> every outer ksp iteration steps? Sounds unlikely - but if so, this could
> cause some extra overheads.
ILU is ONLY done if the matrix has changed (which seems wrong).
>
> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each
> iteration step. Not sure how it works in PETSc, but is that possible that
> both the two “half" relative residuals are counted in the output array,
> doubling the number of iterations (but that cannot explain the extra time
> consumed)?
Yes, PETSc might report them as two, you need to check the exact code.
>
> Anyway, the output with -log_view from the same 278906 by 278906 matrix with
> 3-block D-ILU in PETSc is as follows:
>
>
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by
> donghao Thu Feb 6 09:07:35 2020
> Using Petsc Release Version 3.12.3, unknown
>
> Max Max/Min Avg Total
> Time (sec): 4.443e+00 1.000 4.443e+00
> Objects: 1.155e+03 1.000 1.155e+03
> Flop: 4.311e+09 1.000 4.311e+09 4.311e+09
> Flop/sec: 9.703e+08 1.000 9.703e+08 9.703e+08
> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Reductions: 0.000e+00 0.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flop
> and VecAXPY() for complex vectors of length N
> --> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total
> Avg %Total Count %Total
> 0: Main Stage: 4.4435e+00 100.0% 4.3113e+09 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 0.000e+00 0.0%
>
> ————————————————————————————————————————————————————————————
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths
> in this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 40 48 0 0 0 40 48 0 0 0 1168
> MatSolve 252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 28 0 0 0 29 28 0 0 0 939
> MatLUFactorNum 3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 1 0 0 0 2 1 0 0 0 298
> MatILUFactorSym 3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatAssemblyBegin 5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCreateSubMats 1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatGetOrdering 3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecDot 82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 4 4 0 0 0 4 4 0 0 0 966
> VecDotNorm2 41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 4 4 0 0 0 4 4 0 0 0 1088
> VecNorm 43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 2 0 0 0 2 2 0 0 0 1009
> VecCopy 2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPY 1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2890
> VecAXPBYCZ 82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 5 8 0 0 0 5 8 0 0 0 1502
> VecWAXPY 82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 4 0 0 0 3 4 0 0 0 1293
> VecAssemblyBegin 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 91100 0 0 0 91100 0 0 0 1071
> PCSetUp 4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 1 0 0 0 3 1 0 0 0 156
> PCSetUpOnBlocks 1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 1 0 0 0 3 1 0 0 0 214
> PCApply 84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924
> PCApplyOnBlocks 252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924
>
> ------------------------------------------------------------------------------------------------------------------------
>
> # I skipped the memory part - the options (and compiler options) are as
> follows:
>
> #PETSc Option Table entries:
> -ksp_type bcgs
> -ksp_view
> -log_view
> -pc_bjacobi_local_blocks 3
> -pc_factor_levels 0
> -pc_sub_type ilu
> -pc_type bjacobi
> #End of PETSc Option Table entries
> Compiled with FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
> Configure options: --with-scalar-type=complex --download-mumps
> --download-scalapack --with-fortran-kernels=1 -- FOPTFLAGS=“-O3
> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp”
> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" -- with-debugging=0
> -----------------------------------------
> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
> Using PETSc directory: /Users/donghao/src/git/PETSc-current
> Using PETSc arch: arch-darwin-c-opt
> -----------------------------------------
>
> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
> -Wno-unknown-pragmas -fstack-protector -fno-stack- check -Qunused-arguments
> -fvisibility=hidden
> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0
> -Wno-unused-dummy-argument
>
> Using include paths: -I/Users/donghao/src/git/PETSc-current/include
> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
> -----------------------------------------
>
> Using C linker: mpicc
> Using Fortran linker: mpif90
> Using libraries:
> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
> -L/Users/donghao/src/git/PETSc- current/arch-darwin-c-opt/lib -lpetsc
> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
> -L/Users/ donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/ lib
> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib
> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/
> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0
> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-
> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9
> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps - ldmumps -lsmumps
> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl
> -lmpi_usempif08 - lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
> -lgfortran -lquadmath -lm -lc++ -ldl
>
>
> On the other hand, running PETSc with
>
> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type gmres
> -ksp_monitor -ksp_view -log_view
>
> For the same problem takes 5.37s and 72 GMRES iterations. Our previous
> testings show that BiCGstab (bcgs in PETSc) is almost always the most
> effective KSP solver for our non-symmetrical complex system. Strangely, the
> system is still using ilu instead of lu for sub blocks. The output is like:
-sub_pc_type lu
>
> 0 KSP Residual norm 2.480412407430e+02
> 1 KSP Residual norm 8.848059967835e+01
> 2 KSP Residual norm 3.415272863261e+01
> 3 KSP Residual norm 1.563045190939e+01
> 4 KSP Residual norm 6.241296940043e+00
> 5 KSP Residual norm 2.739710899854e+00
> 6 KSP Residual norm 1.391304148888e+00
> 7 KSP Residual norm 7.959262020849e-01
> 8 KSP Residual norm 4.828323055231e-01
> 9 KSP Residual norm 2.918529739200e-01
> 10 KSP Residual norm 1.905508589557e-01
> 11 KSP Residual norm 1.291541892702e-01
> 12 KSP Residual norm 8.827145774707e-02
> 13 KSP Residual norm 6.521331095889e-02
> 14 KSP Residual norm 5.095787952595e-02
> 15 KSP Residual norm 4.043060387395e-02
> 16 KSP Residual norm 3.232590200012e-02
> 17 KSP Residual norm 2.593944982216e-02
> 18 KSP Residual norm 2.064639483533e-02
> 19 KSP Residual norm 1.653916663492e-02
> 20 KSP Residual norm 1.334946415452e-02
> 21 KSP Residual norm 1.092886880597e-02
> 22 KSP Residual norm 8.988004105542e-03
> 23 KSP Residual norm 7.466501315240e-03
> 24 KSP Residual norm 6.284389135436e-03
> 25 KSP Residual norm 5.425231669964e-03
> 26 KSP Residual norm 4.766338253084e-03
> 27 KSP Residual norm 4.241238878242e-03
> 28 KSP Residual norm 3.808113525685e-03
> 29 KSP Residual norm 3.449383788116e-03
> 30 KSP Residual norm 3.126025526388e-03
> 31 KSP Residual norm 2.958328054299e-03
> 32 KSP Residual norm 2.802344900403e-03
> 33 KSP Residual norm 2.621993580492e-03
> 34 KSP Residual norm 2.430066269304e-03
> 35 KSP Residual norm 2.259043079597e-03
> 36 KSP Residual norm 2.104287972986e-03
> 37 KSP Residual norm 1.952916080045e-03
> 38 KSP Residual norm 1.804988937999e-03
> 39 KSP Residual norm 1.643302117377e-03
> 40 KSP Residual norm 1.471661332036e-03
> 41 KSP Residual norm 1.286445911163e-03
> 42 KSP Residual norm 1.127543025848e-03
> 43 KSP Residual norm 9.777148275484e-04
> 44 KSP Residual norm 8.293314450006e-04
> 45 KSP Residual norm 6.989331136622e-04
> 46 KSP Residual norm 5.852307780220e-04
> 47 KSP Residual norm 4.926715539762e-04
> 48 KSP Residual norm 4.215941372075e-04
> 49 KSP Residual norm 3.699489548162e-04
> 50 KSP Residual norm 3.293897163533e-04
> 51 KSP Residual norm 2.959954542998e-04
> 52 KSP Residual norm 2.700193032414e-04
> 53 KSP Residual norm 2.461789791204e-04
> 54 KSP Residual norm 2.218839085563e-04
> 55 KSP Residual norm 1.945154309976e-04
> 56 KSP Residual norm 1.661128781744e-04
> 57 KSP Residual norm 1.413198766258e-04
> 58 KSP Residual norm 1.213984003195e-04
> 59 KSP Residual norm 1.044317450754e-04
> 60 KSP Residual norm 8.919957502977e-05
> 61 KSP Residual norm 8.042584301275e-05
> 62 KSP Residual norm 7.292784493581e-05
> 63 KSP Residual norm 6.481935501872e-05
> 64 KSP Residual norm 5.718564652679e-05
> 65 KSP Residual norm 5.072589750116e-05
> 66 KSP Residual norm 4.487930741285e-05
> 67 KSP Residual norm 3.941040674119e-05
> 68 KSP Residual norm 3.492873281291e-05
> 69 KSP Residual norm 3.103798339845e-05
> 70 KSP Residual norm 2.822943237409e-05
> 71 KSP Residual norm 2.610615023776e-05
> 72 KSP Residual norm 2.441692671173e-05
> KSP Object: 1 MPI processes
> type: gmres
> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
> with no iterative refinement
> happy breakdown tolerance 1e-30
> maximum iterations=150, nonzero initial guess
> tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
> left preconditioning
> using PRECONDITIONED norm type for convergence test
> PC Object: 1 MPI processes
> type: bjacobi
> number of blocks = 3
> Local solve is same for all blocks, in the following KSP and PC objects:
> KSP Object: (sub_) 1 MPI processes
> type: preonly
> maximum iterations=10000, initial guess is zero
> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> left preconditioning
> using NONE norm type for convergence test
> PC Object: (sub_) 1 MPI processes
> type: ilu
> out-of-place factorization
> 0 levels of fill
> tolerance for zero pivot 2.22045e-14
> matrix ordering: natural
> factor fill ratio given 1., needed 1.
> Factored matrix follows:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=92969, cols=92969
> package used to perform factorization: petsc
> total: nonzeros=638417, allocated nonzeros=638417
> total number of mallocs used during MatSetValues calls=0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: seqaij
> rows=92969, cols=92969
> total: nonzeros=638417, allocated nonzeros=638417
> total number of mallocs used during MatSetValues calls=0
> not using I-node routines
> linear system matrix = precond matrix:
> Mat Object: 1 MPI processes
> type: mpiaij
> rows=278906, cols=278906
> total: nonzeros=3274027, allocated nonzeros=3274027
> total number of mallocs used during MatSetValues calls=0
> not using I-node (on process 0) routines
> ...
>
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 28 24 0 0 0 28 24 0 0 0 1189
> MatSolve 228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 20 14 0 0 0 20 14 0 0 0 944
These flop rates are ok, but not great. Perhaps an older machine.
> MatLUFactorNum 3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 290
> MatILUFactorSym 3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatCreateSubMats 1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatGetOrdering 3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecMDot 72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 21 28 0 0 0 21 28 0 0 0 1877
21 percent of the time in VecMDOT this is huge for s sequential fun. I
think maybe you are using a terrible OpenMP BLAS?
Send configure.log
> VecNorm 76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 2 0 0 0 3 2 0 0 0 1007
> VecScale 75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 4587
> VecCopy 3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecAXPY 6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3575
> VecMAXPY 75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 18 30 0 0 0 18 30 0 0 0 2405
> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecNormalize 75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 3 0 0 0 3 3 0 0 0 1360
> KSPSetUp 4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 93100 0 0 0 93100 0 0 0 1489
> KSPGMRESOrthog 72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 37 57 0 0 0 37 57 0 0 0 2110
> PCSetUp 4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 3 0 0 0 0 3 0 0 0 0 153
> PCSetUpOnBlocks 1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 0 0 0 0 2 0 0 0 0 209
> PCApply 76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925
> PCApplyOnBlocks 228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925
> ————————————————————————————————————————————————————————————
> ...
> #PETSc Option Table entries:
> -ksp_monitor
> -ksp_type gmres
> -ksp_view
> -log_view
> -pc_bjacobi_local_blocks 3
> -pc_sub_type lu
> -pc_type bjacobi
> #End of PETSc Option Table entries
> ...
>
> Does any of the setup/output ring a bell?
>
> BTW, out of curiosity - what is a “I-node” routine?
>
>
> Cheers,
> Hao
>
>
> From: Smith, Barry F. <[email protected]>
> Sent: Wednesday, February 5, 2020 9:42 PM
> To: Hao DONG <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: [petsc-users] What is the right way to implement a (block)
> Diagonal ILU as PC?
>
>
>
> > On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
> >
> > Thanks a lot for your suggestions, Hong and Barry -
> >
> > As you suggested, I first tried the LU direct solvers (built-in and MUMPs)
> > out this morning, which work perfectly, albeit slow. As my problem itself
> > is a part of a PDE based optimization, the exact solution in the searching
> > procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8,
> > etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the
> > KSP just converges in one or two iterations, as expected.
> >
> > I added -kspview option for the D-ILU test (still with Block Jacobi as
> > preconditioner and bcgs as KSP solver). The KSPview output from one of the
> > examples in a toy model looks like:
> >
> > KSP Object: 1 MPI processes
> > type: bcgs
> > maximum iterations=120, nonzero initial guess
> > tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
> > left preconditioning
> > using PRECONDITIONED norm type for convergence test
> > PC Object: 1 MPI processes
> > type: bjacobi
> > number of blocks = 3
> > Local solve is same for all blocks, in the following KSP and PC
> > objects:
> > KSP Object: (sub_) 1 MPI processes
> > type: preonly
> > maximum iterations=10000, initial guess is zero
> > tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
> > left preconditioning
> > using NONE norm type for convergence test
> > PC Object: (sub_) 1 MPI processes
> > type: ilu
> > out-of-place factorization
> > 0 levels of fill
> > tolerance for zero pivot 2.22045e-14
> > matrix ordering: natural
> > factor fill ratio given 1., needed 1.
> > Factored matrix follows:
> > Mat Object: 1 MPI processes
> > type: seqaij
> > rows=11294, cols=11294
> > package used to perform factorization: petsc
> > total: nonzeros=76008, allocated nonzeros=76008
> > total number of mallocs used during MatSetValues calls=0
> > not using I-node routines
> > linear system matrix = precond matrix:
> > Mat Object: 1 MPI processes
> > type: seqaij
> > rows=11294, cols=11294
> > total: nonzeros=76008, allocated nonzeros=76008
> > total number of mallocs used during MatSetValues calls=0
> > not using I-node routines
> > linear system matrix = precond matrix:
> > Mat Object: 1 MPI processes
> > type: mpiaij
> > rows=33880, cols=33880
> > total: nonzeros=436968, allocated nonzeros=436968
> > total number of mallocs used during MatSetValues calls=0
> > not using I-node (on process 0) routines
> >
> > do you see something wrong with my setup?
> >
> > I also tried a quick performance test with a small 278906 by 278906 matrix
> > (3850990 nnz) with the following parameters:
> >
> > -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu
> > -ksp_view
> >
> > Reducing the relative residual to 1E-7
> >
> > Took 4.08s with 41 bcgs iterations.
> >
> > Merely changing the -pc_bjacobi_local_blocks to 6
> >
> > Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with
> > 101 bcgs iterations.
>
> This is normal. more blocks slower convergence
> >
> > As a reference, my home-brew Fortran code solves the same problem (3-block
> > D-ILU0) in
> >
> > 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
> >
> Run the PETSc code with optimization ./configure --with-debugging=0 an
> run the code with -log_view this will show where the PETSc code is spending
> the time (send it to use)
>
>
> >
> >
> > Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a
> > user is allowed to provide his own (separate) L and U Mat for
> > preconditioning (like how it works in Matlab solvers), e.g.
> >
> > x = qmr(A,b,Tol,MaxIter,L,U,x)
> >
> > As I already explicitly constructed the L and U matrices in Fortran, it is
> > not hard to convert them to Mat in Petsc to test Petsc and my Fortran code
> > head-to-head. In that case, the A, b, x, and L/U are all identical, it
> > would be easier to see where the problem came from.
> >
> >
> No, we don't provide this kind of support
>
>
> >
> > BTW, there is another thing I wondered - is there a way to output residual
> > in unpreconditioned norm? I tried to
> >
> > call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
> >
> > But always get an error that current ksp does not support unpreconditioned
> > in LEFT/RIGHT (either way). Is it possible to do that (output
> > unpreconditioned residual) in PETSc at all?
>
> -ksp_monitor_true_residual You can also run GMRES (and some other
> methods) with right preconditioning, -ksp_pc_side right then the residual
> computed is by the algorithm the unpreconditioned residual
>
> KSPSetNormType sets the norm used in the algorithm, it generally always
> has to left or right, only a couple algorithms support unpreconditioned
> directly.
>
> Barry
>
>
> >
> > Cheers,
> > Hao
> >
> >
> > From: Smith, Barry F. <[email protected]>
> > Sent: Tuesday, February 4, 2020 8:27 PM
> > To: Hao DONG <[email protected]>
> > Cc: [email protected] <[email protected]>
> > Subject: Re: [petsc-users] What is the right way to implement a (block)
> > Diagonal ILU as PC?
> >
> >
> >
> > > On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
> > >
> > > Dear all,
> > >
> > >
> > > I have a few questions about the implementation of diagonal ILU PC in
> > > PETSc. I want to solve a very simple system with KSP (in parallel), the
> > > nature of the system (finite difference time-harmonic Maxwell) is
> > > probably not important to the question itself. Long story short, I just
> > > need to solve a set of equations of Ax = b with a block diagonal system
> > > matrix, like (not sure if the mono font works):
> > >
> > > |X |
> > > A =| Y |
> > > | Z|
> > >
> > > Note that A is not really block-diagonal, it’s just a multi-diagonal
> > > matrix determined by the FD mesh, where most elements are close to
> > > diagonal. So instead of a full ILU decomposition, a D-ILU is a good
> > > approximation as a preconditioner, and the number of blocks should not
> > > matter too much:
> > >
> > > |Lx | |Ux |
> > > L = | Ly | and U = | Uy |
> > > | Lz| | Uz|
> > >
> > > Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N
> > > blocks) is quite efficient with Krylov subspace methods like BiCGstab or
> > > QMR in my sequential Fortran/Matlab code.
> > >
> > > So like most users, I am looking for a parallel implement with this
> > > problem in PETSc. After looking through the manual, it seems that the
> > > most straightforward way to do it is through PCBJACOBI. Not sure I
> > > understand it right, I just setup a 3-block PCJACOBI and give each of the
> > > block a KSP with PCILU. Is this supposed to be equivalent to my D-ILU
> > > preconditioner? My little block of fortran code would look like:
> > > ...
> > > call PCBJacobiSetTotalBlocks(pc_local,Ntotal, &
> > > & isubs,ierr)
> > > call PCBJacobiSetLocalBlocks(pc_local, Nsub, &
> > > & isubs(istart:iend),ierr)
> > > ! set up the block jacobi structure
> > > call KSPSetup(ksp_local,ierr)
> > > ! allocate sub ksps
> > > allocate(ksp_sub(Nsub))
> > > call PCBJacobiGetSubKSP(pc_local,Nsub,istart, &
> > > & ksp_sub,ierr)
> > > do i=1,Nsub
> > > call KSPGetPC(ksp_sub(i),pc_sub,ierr)
> > > !ILU preconditioner
> > > call PCSetType(pc_sub,ptype,ierr)
> > > call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
> > > call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
> > > end do
> > > call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
> > > & PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
> > > …
> >
> > This code looks essentially right. You should call with -ksp_view to
> > see exactly what PETSc is using for a solver.
> >
> > >
> > > I understand that the parallel performance may not be comparable, so I
> > > first set up a one-process test (with MPIAij, but all the rows are local
> > > since there is only one process). The system is solved without any
> > > problem (identical results within error). But the performance is actually
> > > a lot worse (code built without debugging flags in performance tests)
> > > than my own home-brew implementation in Fortran (I wrote my own ILU0 in
> > > CSR sparse matrix format), which is hard to believe. I suspect the
> > > difference is from the PC as the PETSc version took much more BiCGstab
> > > iterations (60-ish vs 100-ish) to converge to the same relative tol.
> >
> > PETSc uses GMRES by default with a restart of 30 and left
> > preconditioning. It could be different exact choices in the solver (which
> > is why -ksp_view is so useful) can explain the differences in the runs
> > between your code and PETSc's
> > >
> > > This is further confirmed when I change the setup of D-ILU (using 6 or 9
> > > blocks instead of 3). While my Fortran/Matlab codes see minimal
> > > performance difference (<5%) when I play with the D-ILU setup, increasing
> > > the number of D-ILU blocks from 3 to 9 caused the ksp setup with
> > > PCBJACOBI to suffer a performance decrease of more than 50% in sequential
> > > test.
> >
> > This is odd, the more blocks the smaller each block so the quicker the
> > ILU set up should be. You can run various cases with -log_view and send us
> > the output to see what is happening at each part of the computation in time.
> >
> > > So my implementation IS somewhat different in PETSc. Do I miss something
> > > in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of
> > > how PCBJACOBI works in PETSc?
> >
> > Probably not.
> > >
> > > If this is not the right way to implement a block diagonal ILU as
> > > (parallel) PC, please kindly point me to the right direction. I searched
> > > through the mail list to find some answers, only to find a couple of
> > > similar questions... An example would be nice.
> >
> > You approach seems fundamentally right but I cannot be sure of possible
> > bugs.
> > >
> > > On the other hand, does PETSc support a simple way to use explicit L/U
> > > matrix as a preconditioner? I can import the D-ILU matrix (I already
> > > converted my A matrix into Mat) constructed in my Fortran code to make a
> > > better comparison. Or do I have to construct my own PC using PCshell? If
> > > so, is there a good tutorial/example to learn about how to use PCSHELL
> > > (in a more advanced sense, like how to setup pc side and transpose)?
> >
> > Not sure what you mean by explicit L/U matrix as a preconditioner. As
> > Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or
> > Pastix as the solver. You do not need any shell code. You simply need to
> > set the PCType to lu
> >
> > You can also set all this options from the command line and don't need
> > to write the code specifically. So call KSPSetFromOptions() and then for
> > example
> >
> > -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this
> > last one is applied to each block so you could use -pc_type lu and it would
> > use lu on each block.)
> >
> > -ksp_type_none -pc_type lu -pc_factor_mat_solver_type mumps (do
> > parallel LU with mumps)
> >
> > By not hardwiring in the code and just using options you can test out
> > different cases much quicker
> >
> > Use -ksp_view to make sure that is using the solver the way you expect.
> >
> > Barry
> >
> >
> >
> > Barry
> >
> > >
> > > Thanks in advance,
> > >
> > > Hao