Re: [petsc-users] What is the right way to implement a (block) Diagonal ILU as PC?

Hao DONG Mon, 10 Feb 2020 06:48:57 -0800

Hi Barry, 

Thank you for you suggestions (and insights)! Indeed my initial motivation to 
try out PETSc is the different methods. As my matrix pattern is relatively 
simple (3D time-harmonic Maxwell equation arises from stagger-grid finite 
difference), also considering the fact that I am not wealthy enough to utilize 
the direct solvers, I was looking for a fast Krylov subspace method / 
preconditioner that scale well with, say, tens of cpu cores.


As a simple block-Jacobian preconditioner seems to lose its efficiency with 
more than a handful of blocks, I planned to look into other 
methods/preconditioners, e.g. multigrid (as preconditioner) or domain 
decomposition methods. But I will probably need to look through a number of 
literatures before laying my hands on those (or bother you with more 
questions!). Anyway, thanks again for your kind help. 


All the best, 
Hao

> On Feb 8, 2020, at 8:02 AM, Smith, Barry F. <[email protected]> wrote:
> 
> 
> 
>> On Feb 7, 2020, at 7:44 AM, Hao DONG <[email protected]> wrote:
>> 
>> Thanks, Barry, I really appreciate your help - 
>> 
>> I removed the OpenMP flags and rebuilt PETSc. So the problem is from the 
>> BLAS lib I linked?
> 
>  Yes, the openmp causes it to run in parallel, but the problem is not big 
> enough and the machine is not good enough for parallel BLAS to speed things 
> up, instead it slows things down a lot. We see this often, parallel BLAS must 
> be used with care
> 
>> Not sure which version my BLAS is, though… But I also included the 
>> -download-Scalapack option. Shouldn’t that enable linking with PBLAS 
>> automatically?
>> 
>> After looking at the bcgs code in PETSc, I suppose the iteration residual 
>> recorded is indeed recorded twice per one "actual iteration”. So that can 
>> explain the difference of iteration numbers. 
>> 
>> My laptop is indeed an old machine (MBP15 mid-2014). I just cannot work with 
>> vi without a physical ESC key...
> 
>   The latest has a physical ESC, I am stuff without the ESC for a couple more 
> years.
> 
>> I have attached the configure.log -didn’t know that it is so large! 
>> 
>> Anyway, it seems that the removal of -openmp changes quite a lot of things, 
>> the performance is indeed getting much better - the flop/sec increases by a 
>> factor of 3. Still, I am getting 20 percent of VecMDot, but no VecMDot in 
>> BCGS all (see output below), is that a feature of gmres method? 
> 
>   Yes, GMRES orthogonalizes against the last restart directions which uses 
> these routines while  BCGS does not, this is why BCGS is cheaper per 
> iteration.
> 
>    PETSc is no faster than your code because the algorithm is the same, the 
> compilers the same, and the hardware the same. No way to have clever tricks 
> for PETSc to be much faster. What PETS provides is a huge variety of tested 
> algorithms that no single person could code on their own. Anything in PETSc 
> you could code yourself if you had endless time and get basically the same 
> performance.
> 
>  Barry
> 
> 
>> 
>> here is the output of the same problem with:  
>> 
>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type gmres 
>> -ksp_monitor -ksp_view
>> 
>> 
>> ---------------------------------------------- PETSc Performance Summary: 
>> ----------------------------------------------
>> 
>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>> donghao Fri Feb  7 10:26:19 2020
>> Using Petsc Release Version 3.12.3, unknown
>> 
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           2.520e+00     1.000   2.520e+00
>> Objects:              1.756e+03     1.000   1.756e+03
>> Flop:                 7.910e+09     1.000   7.910e+09  7.910e+09
>> Flop/sec:             3.138e+09     1.000   3.138e+09  3.138e+09
>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Reductions:       0.000e+00     0.000
>> 
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N --> 
>> 2N flop
>>                             and VecAXPY() for complex vectors of length N 
>> --> 8N flop
>> 
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
>> -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count   %Total  
>>    Avg         %Total    Count   %Total
>>  0:      Main Stage: 2.5204e+00 100.0%  7.9096e+09 100.0%  0.000e+00   0.0%  
>> 0.000e+00        0.0%  0.000e+00   0.0%
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> …
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSidedF         1 1.0 3.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMult               75 1.0 6.2884e-01 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 25 24  0  0  0  25 24  0  0  0  2991
>> MatSolve             228 1.0 4.4164e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2445
>> MatLUFactorNum         3 1.0 4.1317e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   574
>> MatILUFactorSym        3 1.0 2.3858e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatAssemblyBegin       5 1.0 4.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         5 1.0 1.5067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetRowIJ            3 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatCreateSubMats       1 1.0 2.4558e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetOrdering         3 1.0 1.3290e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatView                3 1.0 1.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecMDot               72 1.0 4.9875e-01 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 20 28  0  0  0  20 28  0  0  0  4509
>> VecNorm               76 1.0 6.6666e-02 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  2544
>> VecScale              75 1.0 1.7982e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  1  0  0  0   1  1  0  0  0  4653
>> VecCopy                3 1.0 1.5080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               276 1.0 9.6784e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
>> VecAXPY                6 1.0 3.6860e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3632
>> VecMAXPY              75 1.0 4.0490e-01 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 16 30  0  0  0  16 30  0  0  0  5951
>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin       76 1.0 5.3800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecNormalize          75 1.0 8.3690e-02 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  2999
>> KSPSetUp               4 1.0 1.1663e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 2.2119e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 88100  0  0  0  88100  0  0  0  3576
>> KSPGMRESOrthog        72 1.0 8.7843e-01 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 35 57  0  0  0  35 57  0  0  0  5121
>> PCSetUp                4 1.0 9.2448e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  0  0  0  0   4  0  0  0  0   257
>> PCSetUpOnBlocks        1 1.0 6.6597e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   356
>> PCApply               76 1.0 4.6281e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2333
>> PCApplyOnBlocks      228 1.0 4.6262e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2334
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> Average time to get PetscTime(): 1e-07
>> #PETSc Option Table entries:
>> -I LBFGS
>> -ksp_type gmres
>> -ksp_view
>> -log_view
>> -pc_bjacobi_local_blocks 3
>> -pc_type bjacobi
>> -sub_pc_type ilu
>> #End of PETSc Option Table entries
>> Compiled with FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>> Configure options: --with-scalar-type=complex --download-mumps 
>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS="-O3 
>> -ffree-line-length-0 -msse2" --COPTFLAGS="-O3 -msse2" --CXXOPTFLAGS="-O3 
>> -msse2" --with-debugging=0
>> -----------------------------------------
>> Libraries compiled on 2020-02-07 10:07:42 on Haos-MBP
>> Machine characteristics: Darwin-19.3.0-x86_64-i386-64bit
>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>> Using PETSc arch: arch-darwin-c-opt
>> -----------------------------------------
>> 
>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>> -Qunused-arguments -fvisibility=hidden -O3 -msse2
>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>> -Wno-unused-dummy-argument -O3 -ffree-line-length-0 -         msse2
>> -----------------------------------------
>> 
>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-  c-opt/include
>> -----------------------------------------
>> 
>> Using C linker: mpicc
>> Using Fortran linker: mpif90
>> Using libraries: 
>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>> -lgfortran -lquadmath -lm -lc++ -ldl
>> -----------------------------------------
>> 
>> 
>> 
>> The BCGS solver performance is now comparable to my own Fortran code 
>> (1.84s). Still, I feel that there is something wrong hidden somewhere in my 
>> setup - a professional lib should to perform better, I believe. Any other 
>> ideas that I can look into? Interestingly there is no VecMDot operation at 
>> all! Here is the output with the option of: 
>> 
>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type bcgs 
>> -ksp_monitor -ksp_view
>> 
>> 
>> 
>> 
>> ---------------------------------------------- PETSc Performance Summary: 
>> ----------------------------------------------
>> 
>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>> donghao Fri Feb  7 10:38:00 2020
>> Using Petsc Release Version 3.12.3, unknown
>> 
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           2.187e+00     1.000   2.187e+00
>> Objects:              1.155e+03     1.000   1.155e+03
>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>> Flop/sec:             1.971e+09     1.000   1.971e+09  1.971e+09
>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Reductions:       0.000e+00     0.000
>> 
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N --> 
>> 2N flop
>>                             and VecAXPY() for complex vectors of length N 
>> --> 8N flop
>> 
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
>> -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count   %Total  
>>    Avg         %Total    Count   %Total
>>  0:      Main Stage: 2.1870e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0%  
>> 0.000e+00        0.0%  0.000e+00   0.0%
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSidedF         1 1.0 2.2000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMult               83 1.0 7.8726e-01 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 36 48  0  0  0  36 48  0  0  0  2644
>> MatSolve             252 1.0 5.5656e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 25 28  0  0  0  25 28  0  0  0  2144
>> MatLUFactorNum         3 1.0 4.5115e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   526
>> MatILUFactorSym        3 1.0 2.5103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatAssemblyBegin       5 1.0 3.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         5 1.0 1.5709e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetRowIJ            3 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatCreateSubMats       1 1.0 2.8989e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetOrdering         3 1.0 1.1200e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatView                3 1.0 1.2600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecDot                82 1.0 8.9328e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  2048
>> VecDotNorm2           41 1.0 9.9019e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  4  0  0  0   5  4  0  0  0  1848
>> VecNorm               43 1.0 3.9988e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  2399
>> VecCopy                2 1.0 1.1150e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               271 1.0 4.2833e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> VecAXPY                1 1.0 5.9200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3769
>> VecAXPBYCZ            82 1.0 1.1448e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  3196
>> VecWAXPY              82 1.0 6.7460e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  2712
>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin       84 1.0 5.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSetUp               4 1.0 1.4765e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> KSPSolve               1 1.0 1.8514e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 85100  0  0  0  85100  0  0  0  2329
>> PCSetUp                4 1.0 1.0193e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  1  0  0  0   5  1  0  0  0   233
>> PCSetUpOnBlocks        1 1.0 7.1421e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   332
>> PCApply               84 1.0 5.7927e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2060
>> PCApplyOnBlocks      252 1.0 5.7902e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2061
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> 
>> Cheers, 
>> Hao
>> 
>> 
>> 
>> From: Smith, Barry F. <[email protected]>
>> Sent: Thursday, February 6, 2020 7:03 PM
>> To: Hao DONG <[email protected]>
>> Cc: [email protected] <[email protected]>
>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>> Diagonal ILU as PC?
>> 
>> 
>>  Read my comments ALL the way down, they go a long way.
>> 
>>> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
>>> 
>>> Dear Hong and Barry, 
>>> 
>>> Thanks for the suggestions. So there could be some problems in my PETSc 
>>> configuration? - but my PETSc lib was indeed compiled without the debug 
>>> flags (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for 
>>> the compiling and building of PETSc and my own fortran code. My Fortran 
>>> compiling flags are simply like: 
>>> 
>>> -O3 -ffree-line-length-none -fastsse 
>>> 
>>> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but 
>>> not my fortran code, as I don’t have any OMP optimizations in my program). 
>>> Note the performance test results I listed yesterday (e.g. 4.08s with 41 
>>> bcgs iterations.) are without any CSR-array->PETSc translation overhead 
>>> (only include the set and solve part). 
>> 
>>   PETSc doesn't use -openmp in any way for its solvers. Do not use this 
>> option, it may be slowing the code down. Please send configure.log
>> 
>>> 
>>> I have two questions about the performance difference - 
>>> 
>>> 1. Is ilu only factorized once for each iteration, or ilu is performed at 
>>> every outer ksp iteration steps? Sounds unlikely - but if so, this could 
>>> cause some extra overheads. 
>> 
>>   ILU is ONLY done if the matrix has changed (which seems wrong).
>>> 
>>> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each 
>>> iteration step. Not sure how it works in PETSc, but is that possible that 
>>> both the two “half" relative residuals are counted in the output array, 
>>> doubling the number of iterations (but that cannot explain the extra time 
>>> consumed)?
>> 
>>   Yes, PETSc might report them as two, you need to check the exact code.
>> 
>>> 
>>> Anyway, the output with -log_view from the same 278906 by 278906 matrix 
>>> with 3-block D-ILU in PETSc is as follows: 
>>> 
>>> 
>>> ---------------------------------------------- PETSc Performance Summary: 
>>> ----------------------------------------------
>>> 
>>> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>>> donghao Thu Feb  6 09:07:35 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>> 
>>>                          Max       Max/Min     Avg       Total
>>> Time (sec):           4.443e+00     1.000   4.443e+00
>>> Objects:              1.155e+03     1.000   1.155e+03
>>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>>> Flop/sec:             9.703e+08     1.000   9.703e+08  9.703e+08
>>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Reductions:       0.000e+00     0.000
>>> 
>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N 
>>> --> 2N flop
>>>                             and VecAXPY() for complex vectors of length N 
>>> --> 8N flop
>>> 
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages --- 
>>>  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total    Count   %Total 
>>>     Avg         %Total    Count   %Total
>>>  0:      Main Stage: 4.4435e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0% 
>>>  0.000e+00        0.0%  0.000e+00   0.0%
>>> 
>>> ————————————————————————————————————————————————————————————
>>> See the 'Profiling' chapter of the users' manual for details on 
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flop: Max - maximum over all processors
>>>                   Ratio - ratio of maximum to minimum over all processors
>>>    Mess: number of messages sent
>>>    AvgLen: average message length (bytes)
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
>>> PetscLogStagePop().
>>>       %T - percent time in this phase         %F - percent flop in this 
>>> phase
>>>       %M - percent messages in this phase     %L - percent message lengths 
>>> in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
>>> all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 40 48  0  0  0  40 48  0  0  0  1168
>>> MatSolve             252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   939
>>> MatLUFactorNum         3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   298
>>> MatILUFactorSym        3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecDot                82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0   966
>>> VecDotNorm2           41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  1088
>>> VecNorm               43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  1009
>>> VecCopy                2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY                1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2890
>>> VecAXPBYCZ            82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  1502
>>> VecWAXPY              82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  1293
>>> VecAssemblyBegin       2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSetUp               4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 91100  0  0  0  91100  0  0  0  1071
>>> PCSetUp                4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   156
>>> PCSetUpOnBlocks        1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   214
>>> PCApply               84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>> PCApplyOnBlocks      252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> # I skipped the memory part - the options (and compiler options) are as 
>>> follows: 
>>> 
>>> #PETSc Option Table entries: 
>>> -ksp_type bcgs
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_factor_levels 0
>>> -pc_sub_type ilu
>>> -pc_type bjacobi
>>> #End of PETSc Option Table entries
>>> Compiled with FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>> Configure options: --with-scalar-type=complex --download-mumps 
>>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS=“-O3 
>>> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp” 
>>> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" --     with-debugging=0
>>> -----------------------------------------
>>> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
>>> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>> Using PETSc arch: arch-darwin-c-opt
>>> -----------------------------------------
>>> 
>>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>>> -Qunused-arguments -fvisibility=hidden
>>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>>> -Wno-unused-dummy-argument
>>> 
>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
>>> -----------------------------------------
>>> 
>>> Using C linker: mpicc
>>> Using Fortran linker: mpif90
>>> Using libraries: 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>> 
>>> 
>>> On the other hand, running PETSc with 
>>> 
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type gmres 
>>> -ksp_monitor -ksp_view -log_view
>>> 
>>> For the same problem takes 5.37s and  72 GMRES iterations. Our previous 
>>> testings show that BiCGstab (bcgs in PETSc) is almost always the most 
>>> effective KSP solver for our non-symmetrical complex system. Strangely, the 
>>> system is still using ilu instead of lu for sub blocks. The output is like: 
>> 
>>  -sub_pc_type lu
>> 
>>> 
>>>   0 KSP Residual norm 2.480412407430e+02
>>>   1 KSP Residual norm 8.848059967835e+01
>>>   2 KSP Residual norm 3.415272863261e+01
>>>   3 KSP Residual norm 1.563045190939e+01
>>>   4 KSP Residual norm 6.241296940043e+00
>>>   5 KSP Residual norm 2.739710899854e+00
>>>   6 KSP Residual norm 1.391304148888e+00
>>>   7 KSP Residual norm 7.959262020849e-01
>>>   8 KSP Residual norm 4.828323055231e-01
>>>   9 KSP Residual norm 2.918529739200e-01
>>>  10 KSP Residual norm 1.905508589557e-01
>>>  11 KSP Residual norm 1.291541892702e-01
>>>  12 KSP Residual norm 8.827145774707e-02
>>>  13 KSP Residual norm 6.521331095889e-02
>>>  14 KSP Residual norm 5.095787952595e-02
>>>  15 KSP Residual norm 4.043060387395e-02
>>>  16 KSP Residual norm 3.232590200012e-02
>>>  17 KSP Residual norm 2.593944982216e-02
>>>  18 KSP Residual norm 2.064639483533e-02
>>>  19 KSP Residual norm 1.653916663492e-02
>>>  20 KSP Residual norm 1.334946415452e-02
>>>  21 KSP Residual norm 1.092886880597e-02
>>>  22 KSP Residual norm 8.988004105542e-03
>>>  23 KSP Residual norm 7.466501315240e-03
>>>  24 KSP Residual norm 6.284389135436e-03
>>>  25 KSP Residual norm 5.425231669964e-03
>>>  26 KSP Residual norm 4.766338253084e-03
>>>  27 KSP Residual norm 4.241238878242e-03
>>>  28 KSP Residual norm 3.808113525685e-03
>>>  29 KSP Residual norm 3.449383788116e-03
>>>  30 KSP Residual norm 3.126025526388e-03
>>>  31 KSP Residual norm 2.958328054299e-03
>>>  32 KSP Residual norm 2.802344900403e-03
>>>  33 KSP Residual norm 2.621993580492e-03
>>>  34 KSP Residual norm 2.430066269304e-03
>>>  35 KSP Residual norm 2.259043079597e-03
>>>  36 KSP Residual norm 2.104287972986e-03
>>>  37 KSP Residual norm 1.952916080045e-03
>>>  38 KSP Residual norm 1.804988937999e-03
>>>  39 KSP Residual norm 1.643302117377e-03
>>>  40 KSP Residual norm 1.471661332036e-03
>>>  41 KSP Residual norm 1.286445911163e-03
>>>  42 KSP Residual norm 1.127543025848e-03
>>>  43 KSP Residual norm 9.777148275484e-04
>>>  44 KSP Residual norm 8.293314450006e-04
>>>  45 KSP Residual norm 6.989331136622e-04
>>>  46 KSP Residual norm 5.852307780220e-04
>>>  47 KSP Residual norm 4.926715539762e-04
>>>  48 KSP Residual norm 4.215941372075e-04
>>>  49 KSP Residual norm 3.699489548162e-04
>>>  50 KSP Residual norm 3.293897163533e-04
>>>  51 KSP Residual norm 2.959954542998e-04
>>>  52 KSP Residual norm 2.700193032414e-04
>>>  53 KSP Residual norm 2.461789791204e-04
>>>  54 KSP Residual norm 2.218839085563e-04
>>>  55 KSP Residual norm 1.945154309976e-04
>>>  56 KSP Residual norm 1.661128781744e-04
>>>  57 KSP Residual norm 1.413198766258e-04
>>>  58 KSP Residual norm 1.213984003195e-04
>>>  59 KSP Residual norm 1.044317450754e-04
>>>  60 KSP Residual norm 8.919957502977e-05
>>>  61 KSP Residual norm 8.042584301275e-05
>>>  62 KSP Residual norm 7.292784493581e-05
>>>  63 KSP Residual norm 6.481935501872e-05
>>>  64 KSP Residual norm 5.718564652679e-05
>>>  65 KSP Residual norm 5.072589750116e-05
>>>  66 KSP Residual norm 4.487930741285e-05
>>>  67 KSP Residual norm 3.941040674119e-05
>>>  68 KSP Residual norm 3.492873281291e-05
>>>  69 KSP Residual norm 3.103798339845e-05
>>>  70 KSP Residual norm 2.822943237409e-05
>>>  71 KSP Residual norm 2.610615023776e-05
>>>  72 KSP Residual norm 2.441692671173e-05
>>> KSP Object: 1 MPI processes
>>>   type: gmres
>>>     restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization 
>>> with no iterative refinement
>>>     happy breakdown tolerance 1e-30
>>>   maximum iterations=150, nonzero initial guess
>>>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>   left preconditioning
>>>   using PRECONDITIONED norm type for convergence test
>>> PC Object: 1 MPI processes
>>>   type: bjacobi
>>>     number of blocks = 3
>>>     Local solve is same for all blocks, in the following KSP and PC objects:
>>>     KSP Object: (sub_) 1 MPI processes
>>>       type: preonly
>>>       maximum iterations=10000, initial guess is zero
>>>       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>       left preconditioning
>>>       using NONE norm type for convergence test
>>>     PC Object: (sub_) 1 MPI processes
>>>       type: ilu
>>>         out-of-place factorization
>>>         0 levels of fill
>>>         tolerance for zero pivot 2.22045e-14
>>>         matrix ordering: natural
>>>         factor fill ratio given 1., needed 1.
>>>           Factored matrix follows:
>>>             Mat Object: 1 MPI processes
>>>               type: seqaij
>>>               rows=92969, cols=92969
>>>               package used to perform factorization: petsc
>>>               total: nonzeros=638417, allocated nonzeros=638417
>>>               total number of mallocs used during MatSetValues calls=0
>>>                 not using I-node routines
>>>       linear system matrix = precond matrix:
>>>       Mat Object: 1 MPI processes
>>>         type: seqaij
>>>         rows=92969, cols=92969
>>>         total: nonzeros=638417, allocated nonzeros=638417
>>>         total number of mallocs used during MatSetValues calls=0
>>>           not using I-node routines
>>>   linear system matrix = precond matrix:
>>>   Mat Object: 1 MPI processes
>>>     type: mpiaij
>>>     rows=278906, cols=278906
>>>     total: nonzeros=3274027, allocated nonzeros=3274027
>>>     total number of mallocs used during MatSetValues calls=0
>>>       not using I-node (on process 0) routines
>>> ...
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 28 24  0  0  0  28 24  0  0  0  1189
>>> MatSolve             228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   944
>> 
>>   These flop rates are ok, but not great. Perhaps an older machine.
>> 
>>> MatLUFactorNum         3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   290
>>> MatILUFactorSym        3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecMDot               72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 21 28  0  0  0  21 28  0  0  0  1877
>> 
>>    21 percent of the time in VecMDOT this is huge for s sequential fun. I 
>> think maybe you are using a terrible OpenMP BLAS? 
>> 
>>    Send configure.log 
>> 
>> 
>>> VecNorm               76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  1007
>>> VecScale              75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  4587
>>> VecCopy                3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> VecAXPY                6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3575
>>> VecMAXPY              75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 18 30  0  0  0  18 30  0  0  0  2405
>>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecNormalize          75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  1360
>>> KSPSetUp               4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 93100  0  0  0  93100  0  0  0  1489
>>> KSPGMRESOrthog        72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 37 57  0  0  0  37 57  0  0  0  2110
>>> PCSetUp                4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   153
>>> PCSetUpOnBlocks        1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   209
>>> PCApply               76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>> PCApplyOnBlocks      228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>> ————————————————————————————————————————————————————————————
>>> ...
>>> #PETSc Option Table entries:
>>> -ksp_monitor
>>> -ksp_type gmres
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_sub_type lu
>>> -pc_type bjacobi
>>> #End of PETSc Option Table entries
>>> ...
>>> 
>>> Does any of the setup/output ring a bell? 
>>> 
>>> BTW, out of curiosity - what is a “I-node” routine?
>>> 
>>> 
>>> Cheers, 
>>> Hao
>>> 
>>> 
>>> From: Smith, Barry F. <[email protected]>
>>> Sent: Wednesday, February 5, 2020 9:42 PM
>>> To: Hao DONG <[email protected]>
>>> Cc: [email protected] <[email protected]>
>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>> Diagonal ILU as PC?
>>> 
>>> 
>>> 
>>>> On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
>>>> 
>>>> Thanks a lot for your suggestions, Hong and Barry - 
>>>> 
>>>> As you suggested, I first tried the LU direct solvers (built-in and MUMPs) 
>>>> out this morning, which work perfectly, albeit slow. As my problem itself 
>>>> is a part of a PDE based optimization, the exact solution in the searching 
>>>> procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, 
>>>> etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the 
>>>> KSP just converges in one or two iterations, as expected. 
>>>> 
>>>> I added -kspview option for the D-ILU test (still with Block Jacobi as 
>>>> preconditioner and bcgs as KSP solver). The KSPview output from one of the 
>>>> examples in a toy model looks like: 
>>>> 
>>>> KSP Object: 1 MPI processes
>>>>   type: bcgs
>>>>   maximum iterations=120, nonzero initial guess
>>>>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>>   left preconditioning
>>>>   using PRECONDITIONED norm type for convergence test
>>>> PC Object: 1 MPI processes
>>>>   type: bjacobi
>>>>     number of blocks = 3
>>>>     Local solve is same for all blocks, in the following KSP and PC 
>>>> objects:
>>>>     KSP Object: (sub_) 1 MPI processes
>>>>       type: preonly
>>>>       maximum iterations=10000, initial guess is zero
>>>>       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>>       left preconditioning
>>>>       using NONE norm type for convergence test
>>>>     PC Object: (sub_) 1 MPI processes
>>>>       type: ilu
>>>>         out-of-place factorization
>>>>         0 levels of fill
>>>>         tolerance for zero pivot 2.22045e-14
>>>>         matrix ordering: natural
>>>>         factor fill ratio given 1., needed 1.
>>>>           Factored matrix follows:
>>>>             Mat Object: 1 MPI processes
>>>>               type: seqaij
>>>>               rows=11294, cols=11294
>>>>               package used to perform factorization: petsc
>>>>               total: nonzeros=76008, allocated nonzeros=76008
>>>>               total number of mallocs used during MatSetValues calls=0
>>>>                 not using I-node routines
>>>>       linear system matrix = precond matrix:
>>>>       Mat Object: 1 MPI processes
>>>>         type: seqaij
>>>>         rows=11294, cols=11294
>>>>         total: nonzeros=76008, allocated nonzeros=76008
>>>>         total number of mallocs used during MatSetValues calls=0
>>>>           not using I-node routines
>>>>   linear system matrix = precond matrix:
>>>>   Mat Object: 1 MPI processes
>>>>     type: mpiaij
>>>>     rows=33880, cols=33880
>>>>     total: nonzeros=436968, allocated nonzeros=436968
>>>>     total number of mallocs used during MatSetValues calls=0
>>>>       not using I-node (on process 0) routines
>>>> 
>>>> do you see something wrong with my setup?
>>>> 
>>>> I also tried a quick performance test with a small 278906 by 278906 matrix 
>>>> (3850990 nnz) with the following parameters: 
>>>> 
>>>> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type 
>>>> ilu -ksp_view
>>>> 
>>>> Reducing the relative residual to 1E-7 
>>>> 
>>>> Took 4.08s with 41 bcgs iterations. 
>>>> 
>>>> Merely changing the -pc_bjacobi_local_blocks to 6 
>>>> 
>>>> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with 
>>>> 101 bcgs iterations.
>>> 
>>>    This is normal. more blocks slower convergence
>>>> 
>>>> As a reference, my home-brew Fortran code solves the same problem (3-block 
>>>> D-ILU0) in 
>>>> 
>>>> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
>>>> 
>>>    Run the PETSc code with optimization ./configure --with-debugging=0  an 
>>> run the code with -log_view this will show where the PETSc code is spending 
>>> the time (send it to use)
>>> 
>>> 
>>>> 
>>>> 
>>>> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a 
>>>> user is allowed to provide his own (separate) L and U Mat for 
>>>> preconditioning (like how it works in Matlab solvers), e.g.
>>>> 
>>>> x = qmr(A,b,Tol,MaxIter,L,U,x)
>>>> 
>>>> As I already explicitly constructed the L and U matrices in Fortran, it is 
>>>> not hard to convert them to Mat in Petsc to test Petsc and my Fortran code 
>>>> head-to-head. In that case, the A, b, x, and L/U are all identical, it 
>>>> would be easier to see where the problem came from. 
>>>> 
>>>> 
>>>     No, we don't provide this kind of support
>>> 
>>> 
>>>> 
>>>> BTW, there is another thing I wondered - is there a way to output residual 
>>>> in unpreconditioned norm? I tried to
>>>> 
>>>> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
>>>> 
>>>> But always get an error that current ksp does not support unpreconditioned 
>>>> in LEFT/RIGHT (either way). Is it possible to do that (output 
>>>> unpreconditioned residual) in PETSc at all?
>>> 
>>>   -ksp_monitor_true_residual    You can also run GMRES (and some other 
>>> methods) with right preconditioning, -ksp_pc_side right  then the residual 
>>> computed is by the algorithm the unpreconditioned residual
>>> 
>>>   KSPSetNormType sets the norm used in the algorithm, it generally always 
>>> has to left or right, only a couple algorithms support unpreconditioned 
>>> directly.
>>> 
>>>   Barry
>>> 
>>> 
>>>> 
>>>> Cheers, 
>>>> Hao
>>>> 
>>>> 
>>>> From: Smith, Barry F. <[email protected]>
>>>> Sent: Tuesday, February 4, 2020 8:27 PM
>>>> To: Hao DONG <[email protected]>
>>>> Cc: [email protected] <[email protected]>
>>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>>> Diagonal ILU as PC?
>>>> 
>>>> 
>>>> 
>>>>> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
>>>>> 
>>>>> Dear all, 
>>>>> 
>>>>> 
>>>>> I have a few questions about the implementation of diagonal ILU PC in 
>>>>> PETSc. I want to solve a very simple system with KSP (in parallel), the 
>>>>> nature of the system (finite difference time-harmonic Maxwell) is 
>>>>> probably not important to the question itself. Long story short, I just 
>>>>> need to solve a set of equations of Ax = b with a block diagonal system 
>>>>> matrix, like (not sure if the mono font works): 
>>>>> 
>>>>>   |X    |  
>>>>> A =|  Y  |  
>>>>>   |    Z| 
>>>>> 
>>>>> Note that A is not really block-diagonal, it’s just a multi-diagonal 
>>>>> matrix determined by the FD mesh, where most elements are close to 
>>>>> diagonal. So instead of a full ILU decomposition, a D-ILU is a good 
>>>>> approximation as a preconditioner, and the number of blocks should not 
>>>>> matter too much: 
>>>>> 
>>>>>    |Lx      |         |Ux      |
>>>>> L = |   Ly   | and U = |   Uy   |
>>>>>    |      Lz|         |      Uz|
>>>>> 
>>>>> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
>>>>> blocks) is quite efficient with Krylov subspace methods like BiCGstab or 
>>>>> QMR in my sequential Fortran/Matlab code. 
>>>>> 
>>>>> So like most users, I am looking for a parallel implement with this 
>>>>> problem in PETSc. After looking through the manual, it seems that the 
>>>>> most straightforward way to do it is through PCBJACOBI. Not sure I 
>>>>> understand it right, I just setup a 3-block PCJACOBI and give each of the 
>>>>> block a KSP with PCILU. Is this supposed to be equivalent to my D-ILU 
>>>>> preconditioner? My little block of fortran code would look like: 
>>>>> ...
>>>>>      call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
>>>>>     &     isubs,ierr)
>>>>>      call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
>>>>>     &    isubs(istart:iend),ierr)
>>>>>      ! set up the block jacobi structure
>>>>>      call KSPSetup(ksp_local,ierr)
>>>>>      ! allocate sub ksps
>>>>>      allocate(ksp_sub(Nsub))
>>>>>      call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
>>>>>     &     ksp_sub,ierr)
>>>>>      do i=1,Nsub
>>>>>          call KSPGetPC(ksp_sub(i),pc_sub,ierr)
>>>>>          !ILU preconditioner
>>>>>          call PCSetType(pc_sub,ptype,ierr)
>>>>>          call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
>>>>>          call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
>>>>>      end do
>>>>>      call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
>>>>>     &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
>>>>> … 
>>>> 
>>>>     This code looks essentially right. You should call with -ksp_view to 
>>>> see exactly what PETSc is using for a solver. 
>>>> 
>>>>> 
>>>>> I understand that the parallel performance may not be comparable, so I 
>>>>> first set up a one-process test (with MPIAij, but all the rows are local 
>>>>> since there is only one process). The system is solved without any 
>>>>> problem (identical results within error). But the performance is actually 
>>>>> a lot worse (code built without debugging flags in performance tests) 
>>>>> than my own home-brew implementation in Fortran (I wrote my own ILU0 in 
>>>>> CSR sparse matrix format), which is hard to believe. I suspect the 
>>>>> difference is from the PC as the PETSc version took much more BiCGstab 
>>>>> iterations (60-ish vs 100-ish) to converge to the same relative tol. 
>>>> 
>>>>   PETSc uses GMRES by default with a restart of 30 and left 
>>>> preconditioning. It could be different exact choices in the solver (which 
>>>> is why -ksp_view is so useful) can explain the differences in the runs 
>>>> between your code and PETSc's
>>>>> 
>>>>> This is further confirmed when I change the setup of D-ILU (using 6 or 9 
>>>>> blocks instead of 3). While my Fortran/Matlab codes see minimal 
>>>>> performance difference (<5%) when I play with the D-ILU setup, increasing 
>>>>> the number of D-ILU blocks from 3 to 9 caused the ksp setup with 
>>>>> PCBJACOBI to suffer a performance decrease of more than 50% in sequential 
>>>>> test.
>>>> 
>>>>   This is odd, the more blocks the smaller each block so the quicker the 
>>>> ILU set up should be. You can run various cases with -log_view and send us 
>>>> the output to see what is happening at each part of the computation in 
>>>> time.
>>>> 
>>>>> So my implementation IS somewhat different in PETSc. Do I miss something 
>>>>> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of 
>>>>> how PCBJACOBI works in PETSc? 
>>>> 
>>>>   Probably not.
>>>>> 
>>>>> If this is not the right way to implement a block diagonal ILU as 
>>>>> (parallel) PC, please kindly point me to the right direction. I searched 
>>>>> through the mail list to find some answers, only to find a couple of 
>>>>> similar questions... An example would be nice.
>>>> 
>>>>   You approach seems fundamentally right but I cannot be sure of possible 
>>>> bugs.
>>>>> 
>>>>> On the other hand, does PETSc support a simple way to use explicit L/U 
>>>>> matrix as a preconditioner? I can import the  D-ILU matrix (I already 
>>>>> converted my A matrix into Mat) constructed in my Fortran code to make a 
>>>>> better comparison. Or do I have to construct my own PC using PCshell? If 
>>>>> so, is there a good tutorial/example to learn about how to use PCSHELL 
>>>>> (in a more advanced sense, like how to setup pc side and transpose)? 
>>>> 
>>>>   Not sure what you mean by explicit L/U matrix as a preconditioner. As 
>>>> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or 
>>>> Pastix as the solver. You do not need any shell code. You simply need to 
>>>> set the PCType to lu 
>>>> 
>>>>   You can also set all this options from the command line and don't need 
>>>> to write the code specifically. So call KSPSetFromOptions() and then for 
>>>> example
>>>> 
>>>>    -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this 
>>>> last one is applied to each block so you could use -pc_type lu and it 
>>>> would use lu on each block.) 
>>>> 
>>>>   -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do 
>>>> parallel LU with mumps)
>>>> 
>>>> By not hardwiring in the code and just using options you can test out 
>>>> different cases much quicker
>>>> 
>>>> Use -ksp_view to make sure that is using the solver the way you expect.
>>>> 
>>>> Barry
>>>> 
>>>> 
>>>> 
>>>>   Barry
>>>> 
>>>>> 
>>>>> Thanks in advance, 
>>>>> 
>>>>> Hao
>> 
>> <configure.log><configure.log>
>

Re: [petsc-users] What is the right way to implement a (block) Diagonal ILU as PC?

Reply via email to