Read my comments ALL the way down, they go a long way.

> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
> 
> Dear Hong and Barry, 
> 
> Thanks for the suggestions. So there could be some problems in my PETSc 
> configuration? - but my PETSc lib was indeed compiled without the debug flags 
> (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for the 
> compiling and building of PETSc and my own fortran code. My Fortran compiling 
> flags are simply like: 
> 
> -O3 -ffree-line-length-none -fastsse 
> 
> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but 
> not my fortran code, as I don’t have any OMP optimizations in my program). 
> Note the performance test results I listed yesterday (e.g. 4.08s with 41 bcgs 
> iterations.) are without any CSR-array->PETSc translation overhead (only 
> include the set and solve part). 

   PETSc doesn't use -openmp in any way for its solvers. Do not use this 
option, it may be slowing the code down. Please send configure.log

> 
> I have two questions about the performance difference - 
> 
> 1. Is ilu only factorized once for each iteration, or ilu is performed at 
> every outer ksp iteration steps? Sounds unlikely - but if so, this could 
> cause some extra overheads. 

   ILU is ONLY done if the matrix has changed (which seems wrong).
> 
> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each 
> iteration step. Not sure how it works in PETSc, but is that possible that 
> both the two “half" relative residuals are counted in the output array, 
> doubling the number of iterations (but that cannot explain the extra time 
> consumed)?

   Yes, PETSc might report them as two, you need to check the exact code.

> 
> Anyway, the output with -log_view from the same 278906 by 278906 matrix with 
> 3-block D-ILU in PETSc is as follows: 
> 
> 
> ---------------------------------------------- PETSc Performance Summary: 
> ----------------------------------------------
>  
>  MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
> donghao Thu Feb  6 09:07:35 2020
>  Using Petsc Release Version 3.12.3, unknown
>  
>                           Max       Max/Min     Avg       Total
>  Time (sec):           4.443e+00     1.000   4.443e+00
>  Objects:              1.155e+03     1.000   1.155e+03
>  Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>  Flop/sec:             9.703e+08     1.000   9.703e+08  9.703e+08
>  MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>  MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>  MPI Reductions:       0.000e+00     0.000
>  
>  Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                              e.g., VecAXPY() for real vectors of length N --> 
> 2N flop
>                              and VecAXPY() for complex vectors of length N 
> --> 8N flop
>  
>  Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                          Avg     %Total     Avg     %Total    Count   %Total  
>    Avg         %Total    Count   %Total
>   0:      Main Stage: 4.4435e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0%  
> 0.000e+00        0.0%  0.000e+00   0.0%
>  
>  ————————————————————————————————————————————————————————————
>  See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
>  Phase summary info:
>     Count: number of times phase was executed
>     Time and Flop: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>     Mess: number of messages sent
>     AvgLen: average message length (bytes)
>     Reduct: number of global reductions
>     Global: entire computation
>     Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>        %T - percent time in this phase         %F - percent flop in this phase
>        %M - percent messages in this phase     %L - percent message lengths 
> in this phase
>        %R - percent reductions in this phase
>     Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
> all processors)
>  
> ------------------------------------------------------------------------------------------------------------------------
>  Event                Count      Time (sec)     Flop                          
>     --- Global ---  --- Stage ----  Total
>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>  
> ------------------------------------------------------------------------------------------------------------------------
>  
>  --- Event Stage 0: Main Stage
>  
>  BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatMult               83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 40 48  0  0  0  40 48  0  0  0  1168
>  MatSolve             252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 29 28  0  0  0  29 28  0  0  0   939
>  MatLUFactorNum         3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2  1  0  0  0   2  1  0  0  0   298
>  MatILUFactorSym        3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  MatAssemblyBegin       5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatCreateSubMats       1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  MatGetOrdering         3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatView                3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecDot                82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  4  4  0  0  0   4  4  0  0  0   966
>  VecDotNorm2           41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  4  4  0  0  0   4  4  0  0  0  1088
>  VecNorm               43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2  2  0  0  0   2  2  0  0  0  1009
>  VecCopy                2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet               271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  VecAXPY                1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2890
>  VecAXPBYCZ            82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  5  8  0  0  0   5  8  0  0  0  1502
>  VecWAXPY              82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  4  0  0  0   3  4  0  0  0  1293
>  VecAssemblyBegin       2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecScatterBegin       84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSetUp               4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 91100  0  0  0  91100  0  0  0  1071
>  PCSetUp                4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   156
>  PCSetUpOnBlocks        1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  1  0  0  0   3  1  0  0  0   214
>  PCApply               84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>  PCApplyOnBlocks      252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>  
> ------------------------------------------------------------------------------------------------------------------------
> 
> # I skipped the memory part - the options (and compiler options) are as 
> follows: 
> 
> #PETSc Option Table entries: 
>  -ksp_type bcgs
>  -ksp_view
>  -log_view
>  -pc_bjacobi_local_blocks 3
>  -pc_factor_levels 0
>  -pc_sub_type ilu
>  -pc_type bjacobi
>  #End of PETSc Option Table entries
>  Compiled with FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>  Configure options: --with-scalar-type=complex --download-mumps 
> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS=“-O3 
> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp” 
> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" --     with-debugging=0
>  -----------------------------------------
>  Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
>  Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
>  Using PETSc directory: /Users/donghao/src/git/PETSc-current
>  Using PETSc arch: arch-darwin-c-opt
>  -----------------------------------------
> 
>  Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check -Qunused-arguments 
> -fvisibility=hidden
>  Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
> -Wno-unused-dummy-argument
> 
> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
>  -----------------------------------------
>  
>  Using C linker: mpicc
>  Using Fortran linker: mpif90
>  Using libraries: 
> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
> -lgfortran -lquadmath -lm -lc++ -ldl
> 
> 
> On the other hand, running PETSc with 
> 
> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type gmres 
> -ksp_monitor -ksp_view -log_view
> 
> For the same problem takes 5.37s and  72 GMRES iterations. Our previous 
> testings show that BiCGstab (bcgs in PETSc) is almost always the most 
> effective KSP solver for our non-symmetrical complex system. Strangely, the 
> system is still using ilu instead of lu for sub blocks. The output is like: 

  -sub_pc_type lu

> 
>    0 KSP Residual norm 2.480412407430e+02
>    1 KSP Residual norm 8.848059967835e+01
>    2 KSP Residual norm 3.415272863261e+01
>    3 KSP Residual norm 1.563045190939e+01
>    4 KSP Residual norm 6.241296940043e+00
>    5 KSP Residual norm 2.739710899854e+00
>    6 KSP Residual norm 1.391304148888e+00
>    7 KSP Residual norm 7.959262020849e-01
>    8 KSP Residual norm 4.828323055231e-01
>    9 KSP Residual norm 2.918529739200e-01
>   10 KSP Residual norm 1.905508589557e-01
>   11 KSP Residual norm 1.291541892702e-01
>   12 KSP Residual norm 8.827145774707e-02
>   13 KSP Residual norm 6.521331095889e-02
>   14 KSP Residual norm 5.095787952595e-02
>   15 KSP Residual norm 4.043060387395e-02
>   16 KSP Residual norm 3.232590200012e-02
>   17 KSP Residual norm 2.593944982216e-02
>   18 KSP Residual norm 2.064639483533e-02
>   19 KSP Residual norm 1.653916663492e-02
>   20 KSP Residual norm 1.334946415452e-02
>   21 KSP Residual norm 1.092886880597e-02
>   22 KSP Residual norm 8.988004105542e-03
>   23 KSP Residual norm 7.466501315240e-03
>   24 KSP Residual norm 6.284389135436e-03
>   25 KSP Residual norm 5.425231669964e-03
>   26 KSP Residual norm 4.766338253084e-03
>   27 KSP Residual norm 4.241238878242e-03
>   28 KSP Residual norm 3.808113525685e-03
>   29 KSP Residual norm 3.449383788116e-03
>   30 KSP Residual norm 3.126025526388e-03
>   31 KSP Residual norm 2.958328054299e-03
>   32 KSP Residual norm 2.802344900403e-03
>   33 KSP Residual norm 2.621993580492e-03
>   34 KSP Residual norm 2.430066269304e-03
>   35 KSP Residual norm 2.259043079597e-03
>   36 KSP Residual norm 2.104287972986e-03
>   37 KSP Residual norm 1.952916080045e-03
>   38 KSP Residual norm 1.804988937999e-03
>   39 KSP Residual norm 1.643302117377e-03
>   40 KSP Residual norm 1.471661332036e-03
>   41 KSP Residual norm 1.286445911163e-03
>   42 KSP Residual norm 1.127543025848e-03
>   43 KSP Residual norm 9.777148275484e-04
>   44 KSP Residual norm 8.293314450006e-04
>   45 KSP Residual norm 6.989331136622e-04
>   46 KSP Residual norm 5.852307780220e-04
>   47 KSP Residual norm 4.926715539762e-04
>   48 KSP Residual norm 4.215941372075e-04
>   49 KSP Residual norm 3.699489548162e-04
>   50 KSP Residual norm 3.293897163533e-04
>   51 KSP Residual norm 2.959954542998e-04
>   52 KSP Residual norm 2.700193032414e-04
>   53 KSP Residual norm 2.461789791204e-04
>   54 KSP Residual norm 2.218839085563e-04
>   55 KSP Residual norm 1.945154309976e-04
>   56 KSP Residual norm 1.661128781744e-04
>   57 KSP Residual norm 1.413198766258e-04
>   58 KSP Residual norm 1.213984003195e-04
>   59 KSP Residual norm 1.044317450754e-04
>   60 KSP Residual norm 8.919957502977e-05
>   61 KSP Residual norm 8.042584301275e-05
>   62 KSP Residual norm 7.292784493581e-05
>   63 KSP Residual norm 6.481935501872e-05
>   64 KSP Residual norm 5.718564652679e-05
>   65 KSP Residual norm 5.072589750116e-05
>   66 KSP Residual norm 4.487930741285e-05
>   67 KSP Residual norm 3.941040674119e-05
>   68 KSP Residual norm 3.492873281291e-05
>   69 KSP Residual norm 3.103798339845e-05
>   70 KSP Residual norm 2.822943237409e-05
>   71 KSP Residual norm 2.610615023776e-05
>   72 KSP Residual norm 2.441692671173e-05
>  KSP Object: 1 MPI processes
>    type: gmres
>      restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization 
> with no iterative refinement
>      happy breakdown tolerance 1e-30
>    maximum iterations=150, nonzero initial guess
>    tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>    left preconditioning
>    using PRECONDITIONED norm type for convergence test
>  PC Object: 1 MPI processes
>    type: bjacobi
>      number of blocks = 3
>      Local solve is same for all blocks, in the following KSP and PC objects:
>      KSP Object: (sub_) 1 MPI processes
>        type: preonly
>        maximum iterations=10000, initial guess is zero
>        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>        left preconditioning
>        using NONE norm type for convergence test
>      PC Object: (sub_) 1 MPI processes
>        type: ilu
>          out-of-place factorization
>          0 levels of fill
>          tolerance for zero pivot 2.22045e-14
>          matrix ordering: natural
>          factor fill ratio given 1., needed 1.
>            Factored matrix follows:
>              Mat Object: 1 MPI processes
>                type: seqaij
>                rows=92969, cols=92969
>                package used to perform factorization: petsc
>                total: nonzeros=638417, allocated nonzeros=638417
>                total number of mallocs used during MatSetValues calls=0
>                  not using I-node routines
>        linear system matrix = precond matrix:
>        Mat Object: 1 MPI processes
>          type: seqaij
>          rows=92969, cols=92969
>          total: nonzeros=638417, allocated nonzeros=638417
>          total number of mallocs used during MatSetValues calls=0
>            not using I-node routines
>    linear system matrix = precond matrix:
>    Mat Object: 1 MPI processes
>      type: mpiaij
>      rows=278906, cols=278906
>      total: nonzeros=3274027, allocated nonzeros=3274027
>      total number of mallocs used during MatSetValues calls=0
>        not using I-node (on process 0) routines
> ...
>  
> ------------------------------------------------------------------------------------------------------------------------
>  Event                Count      Time (sec)     Flop                          
>     --- Global ---  --- Stage ----  Total
>                     Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>  
> ------------------------------------------------------------------------------------------------------------------------
>  
>  --- Event Stage 0: Main Stage
>  
>  BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatMult               75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 28 24  0  0  0  28 24  0  0  0  1189
>  MatSolve             228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 20 14  0  0  0  20 14  0  0  0   944

   These flop rates are ok, but not great. Perhaps an older machine.

>  MatLUFactorNum         3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   290
>  MatILUFactorSym        3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatCreateSubMats       1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>  MatGetOrdering         3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatView                3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecMDot               72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 21 28  0  0  0  21 28  0  0  0  1877

    21 percent of the time in VecMDOT this is huge for s sequential fun. I 
think maybe you are using a terrible OpenMP BLAS? 

    Send configure.log 


>  VecNorm               76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  2  0  0  0   3  2  0  0  0  1007
>  VecScale              75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   0  1  0  0  0  4587
>  VecCopy                3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet               276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>  VecAXPY                6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3575
>  VecMAXPY              75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 18 30  0  0  0  18 30  0  0  0  2405
>  VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecScatterBegin       76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecNormalize          75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  3  0  0  0   3  3  0  0  0  1360
>  KSPSetUp               4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 93100  0  0  0  93100  0  0  0  1489
>  KSPGMRESOrthog        72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 37 57  0  0  0  37 57  0  0  0  2110
>  PCSetUp                4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0   3  0  0  0  0   153
>  PCSetUpOnBlocks        1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0   2  0  0  0  0   209
>  PCApply               76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>  PCApplyOnBlocks      228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>  ————————————————————————————————————————————————————————————
> ...
> #PETSc Option Table entries:
>  -ksp_monitor
>  -ksp_type gmres
>  -ksp_view
>  -log_view
>  -pc_bjacobi_local_blocks 3
>  -pc_sub_type lu
>  -pc_type bjacobi
>  #End of PETSc Option Table entries
> ...
> 
> Does any of the setup/output ring a bell? 
> 
> BTW, out of curiosity - what is a “I-node” routine?
> 
> 
> Cheers, 
> Hao
> 
> 
> From: Smith, Barry F. <[email protected]>
> Sent: Wednesday, February 5, 2020 9:42 PM
> To: Hao DONG <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: [petsc-users] What is the right way to implement a (block) 
> Diagonal ILU as PC?
>  
> 
> 
> > On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
> > 
> > Thanks a lot for your suggestions, Hong and Barry - 
> > 
> > As you suggested, I first tried the LU direct solvers (built-in and MUMPs) 
> > out this morning, which work perfectly, albeit slow. As my problem itself 
> > is a part of a PDE based optimization, the exact solution in the searching 
> > procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, 
> > etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the 
> > KSP just converges in one or two iterations, as expected. 
> > 
> > I added -kspview option for the D-ILU test (still with Block Jacobi as 
> > preconditioner and bcgs as KSP solver). The KSPview output from one of the 
> > examples in a toy model looks like: 
> > 
> > KSP Object: 1 MPI processes
> >    type: bcgs
> >    maximum iterations=120, nonzero initial guess
> >    tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
> >    left preconditioning
> >    using PRECONDITIONED norm type for convergence test
> >  PC Object: 1 MPI processes
> >    type: bjacobi
> >      number of blocks = 3
> >      Local solve is same for all blocks, in the following KSP and PC 
> > objects:
> >      KSP Object: (sub_) 1 MPI processes
> >        type: preonly
> >        maximum iterations=10000, initial guess is zero
> >        tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> >        left preconditioning
> >        using NONE norm type for convergence test
> >      PC Object: (sub_) 1 MPI processes
> >        type: ilu
> >          out-of-place factorization
> >          0 levels of fill
> >          tolerance for zero pivot 2.22045e-14
> >          matrix ordering: natural
> >          factor fill ratio given 1., needed 1.
> >            Factored matrix follows:
> >              Mat Object: 1 MPI processes
> >                type: seqaij
> >                rows=11294, cols=11294
> >                package used to perform factorization: petsc
> >                total: nonzeros=76008, allocated nonzeros=76008
> >                total number of mallocs used during MatSetValues calls=0
> >                  not using I-node routines
> >        linear system matrix = precond matrix:
> >        Mat Object: 1 MPI processes
> >          type: seqaij
> >          rows=11294, cols=11294
> >          total: nonzeros=76008, allocated nonzeros=76008
> >          total number of mallocs used during MatSetValues calls=0
> >            not using I-node routines
> >    linear system matrix = precond matrix:
> >    Mat Object: 1 MPI processes
> >      type: mpiaij
> >      rows=33880, cols=33880
> >      total: nonzeros=436968, allocated nonzeros=436968
> >      total number of mallocs used during MatSetValues calls=0
> >        not using I-node (on process 0) routines
> > 
> > do you see something wrong with my setup?
> > 
> > I also tried a quick performance test with a small 278906 by 278906 matrix 
> > (3850990 nnz) with the following parameters: 
> > 
> > -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu 
> > -ksp_view
> > 
> > Reducing the relative residual to 1E-7 
> > 
> > Took 4.08s with 41 bcgs iterations. 
> > 
> > Merely changing the -pc_bjacobi_local_blocks to 6 
> > 
> > Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with 
> > 101 bcgs iterations.
> 
>     This is normal. more blocks slower convergence
> > 
> > As a reference, my home-brew Fortran code solves the same problem (3-block 
> > D-ILU0) in 
> > 
> > 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
> > 
>     Run the PETSc code with optimization ./configure --with-debugging=0  an 
> run the code with -log_view this will show where the PETSc code is spending 
> the time (send it to use)
> 
> 
> > 
> > 
> > Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a 
> > user is allowed to provide his own (separate) L and U Mat for 
> > preconditioning (like how it works in Matlab solvers), e.g.
> > 
> > x = qmr(A,b,Tol,MaxIter,L,U,x)
> >  
> > As I already explicitly constructed the L and U matrices in Fortran, it is 
> > not hard to convert them to Mat in Petsc to test Petsc and my Fortran code 
> > head-to-head. In that case, the A, b, x, and L/U are all identical, it 
> > would be easier to see where the problem came from. 
> > 
> > 
>      No, we don't provide this kind of support
>      
> 
> > 
> > BTW, there is another thing I wondered - is there a way to output residual 
> > in unpreconditioned norm? I tried to
> > 
> > call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
> > 
> > But always get an error that current ksp does not support unpreconditioned 
> > in LEFT/RIGHT (either way). Is it possible to do that (output 
> > unpreconditioned residual) in PETSc at all?
> 
>    -ksp_monitor_true_residual    You can also run GMRES (and some other 
> methods) with right preconditioning, -ksp_pc_side right  then the residual 
> computed is by the algorithm the unpreconditioned residual
> 
>    KSPSetNormType sets the norm used in the algorithm, it generally always 
> has to left or right, only a couple algorithms support unpreconditioned 
> directly.
> 
>    Barry
> 
> 
> > 
> > Cheers, 
> > Hao
> > 
> > 
> > From: Smith, Barry F. <[email protected]>
> > Sent: Tuesday, February 4, 2020 8:27 PM
> > To: Hao DONG <[email protected]>
> > Cc: [email protected] <[email protected]>
> > Subject: Re: [petsc-users] What is the right way to implement a (block) 
> > Diagonal ILU as PC?
> >  
> > 
> > 
> > > On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
> > > 
> > > Dear all, 
> > > 
> > > 
> > > I have a few questions about the implementation of diagonal ILU PC in 
> > > PETSc. I want to solve a very simple system with KSP (in parallel), the 
> > > nature of the system (finite difference time-harmonic Maxwell) is 
> > > probably not important to the question itself. Long story short, I just 
> > > need to solve a set of equations of Ax = b with a block diagonal system 
> > > matrix, like (not sure if the mono font works): 
> > > 
> > >    |X    |  
> > > A =|  Y  |  
> > >    |    Z| 
> > > 
> > > Note that A is not really block-diagonal, it’s just a multi-diagonal 
> > > matrix determined by the FD mesh, where most elements are close to 
> > > diagonal. So instead of a full ILU decomposition, a D-ILU is a good 
> > > approximation as a preconditioner, and the number of blocks should not 
> > > matter too much: 
> > > 
> > >     |Lx      |         |Ux      |
> > > L = |   Ly   | and U = |   Uy   |
> > >     |      Lz|         |      Uz|
> > > 
> > > Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
> > > blocks) is quite efficient with Krylov subspace methods like BiCGstab or 
> > > QMR in my sequential Fortran/Matlab code. 
> > > 
> > > So like most users, I am looking for a parallel implement with this 
> > > problem in PETSc. After looking through the manual, it seems that the 
> > > most straightforward way to do it is through PCBJACOBI. Not sure I 
> > > understand it right, I just setup a 3-block PCJACOBI and give each of the 
> > > block a KSP with PCILU. Is this supposed to be equivalent to my D-ILU 
> > > preconditioner? My little block of fortran code would look like: 
> > > ...
> > >       call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
> > >      &     isubs,ierr)
> > >       call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
> > >      &    isubs(istart:iend),ierr)
> > >       ! set up the block jacobi structure
> > >       call KSPSetup(ksp_local,ierr)
> > >       ! allocate sub ksps
> > >       allocate(ksp_sub(Nsub))
> > >       call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
> > >      &     ksp_sub,ierr)
> > >       do i=1,Nsub
> > >           call KSPGetPC(ksp_sub(i),pc_sub,ierr)
> > >           !ILU preconditioner
> > >           call PCSetType(pc_sub,ptype,ierr)
> > >           call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
> > >           call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
> > >       end do
> > >       call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
> > >      &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
> > > … 
> > 
> >      This code looks essentially right. You should call with -ksp_view to 
> > see exactly what PETSc is using for a solver. 
> > 
> > > 
> > > I understand that the parallel performance may not be comparable, so I 
> > > first set up a one-process test (with MPIAij, but all the rows are local 
> > > since there is only one process). The system is solved without any 
> > > problem (identical results within error). But the performance is actually 
> > > a lot worse (code built without debugging flags in performance tests) 
> > > than my own home-brew implementation in Fortran (I wrote my own ILU0 in 
> > > CSR sparse matrix format), which is hard to believe. I suspect the 
> > > difference is from the PC as the PETSc version took much more BiCGstab 
> > > iterations (60-ish vs 100-ish) to converge to the same relative tol. 
> > 
> >    PETSc uses GMRES by default with a restart of 30 and left 
> > preconditioning. It could be different exact choices in the solver (which 
> > is why -ksp_view is so useful) can explain the differences in the runs 
> > between your code and PETSc's
> > > 
> > > This is further confirmed when I change the setup of D-ILU (using 6 or 9 
> > > blocks instead of 3). While my Fortran/Matlab codes see minimal 
> > > performance difference (<5%) when I play with the D-ILU setup, increasing 
> > > the number of D-ILU blocks from 3 to 9 caused the ksp setup with 
> > > PCBJACOBI to suffer a performance decrease of more than 50% in sequential 
> > > test.
> > 
> >    This is odd, the more blocks the smaller each block so the quicker the 
> > ILU set up should be. You can run various cases with -log_view and send us 
> > the output to see what is happening at each part of the computation in time.
> >  
> > > So my implementation IS somewhat different in PETSc. Do I miss something 
> > > in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of 
> > > how PCBJACOBI works in PETSc? 
> > 
> >    Probably not.
> > > 
> > > If this is not the right way to implement a block diagonal ILU as 
> > > (parallel) PC, please kindly point me to the right direction. I searched 
> > > through the mail list to find some answers, only to find a couple of 
> > > similar questions... An example would be nice.
> > 
> >    You approach seems fundamentally right but I cannot be sure of possible 
> > bugs.
> > > 
> > > On the other hand, does PETSc support a simple way to use explicit L/U 
> > > matrix as a preconditioner? I can import the  D-ILU matrix (I already 
> > > converted my A matrix into Mat) constructed in my Fortran code to make a 
> > > better comparison. Or do I have to construct my own PC using PCshell? If 
> > > so, is there a good tutorial/example to learn about how to use PCSHELL 
> > > (in a more advanced sense, like how to setup pc side and transpose)? 
> > 
> >    Not sure what you mean by explicit L/U matrix as a preconditioner. As 
> > Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or 
> > Pastix as the solver. You do not need any shell code. You simply need to 
> > set the PCType to lu 
> > 
> >    You can also set all this options from the command line and don't need 
> > to write the code specifically. So call KSPSetFromOptions() and then for 
> > example
> > 
> >     -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this 
> > last one is applied to each block so you could use -pc_type lu and it would 
> > use lu on each block.) 
> > 
> >    -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do 
> > parallel LU with mumps)
> > 
> > By not hardwiring in the code and just using options you can test out 
> > different cases much quicker
> > 
> > Use -ksp_view to make sure that is using the solver the way you expect.
> > 
> > Barry
> > 
> > 
> > 
> >    Barry
> > 
> > > 
> > > Thanks in advance, 
> > > 
> > > Hao

Reply via email to