Hi Barry, 

Thank you for you suggestions (and insights)! Indeed my initial motivation to 
try out PETSc is the different methods. As my matrix pattern is relatively 
simple (3D time-harmonic Maxwell equation arises from stagger-grid finite 
difference), also considering the fact that I am not wealthy enough to utilize 
the direct solvers, I was looking for a fast Krylov subspace method / 
preconditioner that scale well with, say, tens of cpu cores. 

As a simple block-Jacobian preconditioner seems to lose its efficiency with 
more than a handful of blocks, I planned to look into other 
methods/preconditioners, e.g. multigrid (as preconditioner) or domain 
decomposition methods. But I will probably need to look through a number of 
literatures before laying my hands on those (or bother you with more 
questions!). Anyway, thanks again for your kind help. 


All the best, 
Hao

> On Feb 8, 2020, at 8:02 AM, Smith, Barry F. <[email protected]> wrote:
> 
> 
> 
>> On Feb 7, 2020, at 7:44 AM, Hao DONG <[email protected]> wrote:
>> 
>> Thanks, Barry, I really appreciate your help - 
>> 
>> I removed the OpenMP flags and rebuilt PETSc. So the problem is from the 
>> BLAS lib I linked?
> 
>  Yes, the openmp causes it to run in parallel, but the problem is not big 
> enough and the machine is not good enough for parallel BLAS to speed things 
> up, instead it slows things down a lot. We see this often, parallel BLAS must 
> be used with care
> 
>> Not sure which version my BLAS is, though… But I also included the 
>> -download-Scalapack option. Shouldn’t that enable linking with PBLAS 
>> automatically?
>> 
>> After looking at the bcgs code in PETSc, I suppose the iteration residual 
>> recorded is indeed recorded twice per one "actual iteration”. So that can 
>> explain the difference of iteration numbers. 
>> 
>> My laptop is indeed an old machine (MBP15 mid-2014). I just cannot work with 
>> vi without a physical ESC key...
> 
>   The latest has a physical ESC, I am stuff without the ESC for a couple more 
> years.
> 
>> I have attached the configure.log -didn’t know that it is so large! 
>> 
>> Anyway, it seems that the removal of -openmp changes quite a lot of things, 
>> the performance is indeed getting much better - the flop/sec increases by a 
>> factor of 3. Still, I am getting 20 percent of VecMDot, but no VecMDot in 
>> BCGS all (see output below), is that a feature of gmres method? 
> 
>   Yes, GMRES orthogonalizes against the last restart directions which uses 
> these routines while  BCGS does not, this is why BCGS is cheaper per 
> iteration.
> 
>    PETSc is no faster than your code because the algorithm is the same, the 
> compilers the same, and the hardware the same. No way to have clever tricks 
> for PETSc to be much faster. What PETS provides is a huge variety of tested 
> algorithms that no single person could code on their own. Anything in PETSc 
> you could code yourself if you had endless time and get basically the same 
> performance.
> 
>  Barry
> 
> 
>> 
>> here is the output of the same problem with:  
>> 
>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type gmres 
>> -ksp_monitor -ksp_view
>> 
>> 
>> ---------------------------------------------- PETSc Performance Summary: 
>> ----------------------------------------------
>> 
>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>> donghao Fri Feb  7 10:26:19 2020
>> Using Petsc Release Version 3.12.3, unknown
>> 
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           2.520e+00     1.000   2.520e+00
>> Objects:              1.756e+03     1.000   1.756e+03
>> Flop:                 7.910e+09     1.000   7.910e+09  7.910e+09
>> Flop/sec:             3.138e+09     1.000   3.138e+09  3.138e+09
>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Reductions:       0.000e+00     0.000
>> 
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N --> 
>> 2N flop
>>                             and VecAXPY() for complex vectors of length N 
>> --> 8N flop
>> 
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
>> -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count   %Total  
>>    Avg         %Total    Count   %Total
>>  0:      Main Stage: 2.5204e+00 100.0%  7.9096e+09 100.0%  0.000e+00   0.0%  
>> 0.000e+00        0.0%  0.000e+00   0.0%
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> …
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSidedF         1 1.0 3.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMult               75 1.0 6.2884e-01 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 25 24  0  0  0  25 24  0  0  0  2991
>> MatSolve             228 1.0 4.4164e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2445
>> MatLUFactorNum         3 1.0 4.1317e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   574
>> MatILUFactorSym        3 1.0 2.3858e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatAssemblyBegin       5 1.0 4.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         5 1.0 1.5067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetRowIJ            3 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatCreateSubMats       1 1.0 2.4558e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetOrdering         3 1.0 1.3290e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatView                3 1.0 1.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecMDot               72 1.0 4.9875e-01 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 20 28  0  0  0  20 28  0  0  0  4509
>> VecNorm               76 1.0 6.6666e-02 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  2544
>> VecScale              75 1.0 1.7982e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  1  0  0  0   1  1  0  0  0  4653
>> VecCopy                3 1.0 1.5080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               276 1.0 9.6784e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
>> VecAXPY                6 1.0 3.6860e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3632
>> VecMAXPY              75 1.0 4.0490e-01 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 16 30  0  0  0  16 30  0  0  0  5951
>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin       76 1.0 5.3800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecNormalize          75 1.0 8.3690e-02 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  2999
>> KSPSetUp               4 1.0 1.1663e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSolve               1 1.0 2.2119e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 88100  0  0  0  88100  0  0  0  3576
>> KSPGMRESOrthog        72 1.0 8.7843e-01 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 35 57  0  0  0  35 57  0  0  0  5121
>> PCSetUp                4 1.0 9.2448e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  0  0  0  0   4  0  0  0  0   257
>> PCSetUpOnBlocks        1 1.0 6.6597e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   356
>> PCApply               76 1.0 4.6281e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2333
>> PCApplyOnBlocks      228 1.0 4.6262e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2334
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> Average time to get PetscTime(): 1e-07
>> #PETSc Option Table entries:
>> -I LBFGS
>> -ksp_type gmres
>> -ksp_view
>> -log_view
>> -pc_bjacobi_local_blocks 3
>> -pc_type bjacobi
>> -sub_pc_type ilu
>> #End of PETSc Option Table entries
>> Compiled with FORTRAN kernels
>> Compiled with full precision matrices (default)
>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>> Configure options: --with-scalar-type=complex --download-mumps 
>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS="-O3 
>> -ffree-line-length-0 -msse2" --COPTFLAGS="-O3 -msse2" --CXXOPTFLAGS="-O3 
>> -msse2" --with-debugging=0
>> -----------------------------------------
>> Libraries compiled on 2020-02-07 10:07:42 on Haos-MBP
>> Machine characteristics: Darwin-19.3.0-x86_64-i386-64bit
>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>> Using PETSc arch: arch-darwin-c-opt
>> -----------------------------------------
>> 
>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>> -Qunused-arguments -fvisibility=hidden -O3 -msse2
>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>> -Wno-unused-dummy-argument -O3 -ffree-line-length-0 -         msse2
>> -----------------------------------------
>> 
>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-  c-opt/include
>> -----------------------------------------
>> 
>> Using C linker: mpicc
>> Using Fortran linker: mpif90
>> Using libraries: 
>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>> -lgfortran -lquadmath -lm -lc++ -ldl
>> -----------------------------------------
>> 
>> 
>> 
>> The BCGS solver performance is now comparable to my own Fortran code 
>> (1.84s). Still, I feel that there is something wrong hidden somewhere in my 
>> setup - a professional lib should to perform better, I believe. Any other 
>> ideas that I can look into? Interestingly there is no VecMDot operation at 
>> all! Here is the output with the option of: 
>> 
>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type bcgs 
>> -ksp_monitor -ksp_view
>> 
>> 
>> 
>> 
>> ---------------------------------------------- PETSc Performance Summary: 
>> ----------------------------------------------
>> 
>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>> donghao Fri Feb  7 10:38:00 2020
>> Using Petsc Release Version 3.12.3, unknown
>> 
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           2.187e+00     1.000   2.187e+00
>> Objects:              1.155e+03     1.000   1.155e+03
>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>> Flop/sec:             1.971e+09     1.000   1.971e+09  1.971e+09
>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>> MPI Reductions:       0.000e+00     0.000
>> 
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N --> 
>> 2N flop
>>                             and VecAXPY() for complex vectors of length N 
>> --> 8N flop
>> 
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
>> -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count   %Total  
>>    Avg         %Total    Count   %Total
>>  0:      Main Stage: 2.1870e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0%  
>> 0.000e+00        0.0%  0.000e+00   0.0%
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSidedF         1 1.0 2.2000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatMult               83 1.0 7.8726e-01 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 36 48  0  0  0  36 48  0  0  0  2644
>> MatSolve             252 1.0 5.5656e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 25 28  0  0  0  25 28  0  0  0  2144
>> MatLUFactorNum         3 1.0 4.5115e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   526
>> MatILUFactorSym        3 1.0 2.5103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatAssemblyBegin       5 1.0 3.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatAssemblyEnd         5 1.0 1.5709e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetRowIJ            3 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatCreateSubMats       1 1.0 2.8989e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> MatGetOrdering         3 1.0 1.1200e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> MatView                3 1.0 1.2600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecDot                82 1.0 8.9328e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  2048
>> VecDotNorm2           41 1.0 9.9019e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  4  0  0  0   5  4  0  0  0  1848
>> VecNorm               43 1.0 3.9988e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  2399
>> VecCopy                2 1.0 1.1150e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecSet               271 1.0 4.2833e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>> VecAXPY                1 1.0 5.9200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3769
>> VecAXPBYCZ            82 1.0 1.1448e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  3196
>> VecWAXPY              82 1.0 6.7460e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  2712
>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> VecScatterBegin       84 1.0 5.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>> KSPSetUp               4 1.0 1.4765e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>> KSPSolve               1 1.0 1.8514e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 85100  0  0  0  85100  0  0  0  2329
>> PCSetUp                4 1.0 1.0193e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  5  1  0  0  0   5  1  0  0  0   233
>> PCSetUpOnBlocks        1 1.0 7.1421e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   332
>> PCApply               84 1.0 5.7927e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2060
>> PCApplyOnBlocks      252 1.0 5.7902e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2061
>> ------------------------------------------------------------------------------------------------------------------------
>> 
>> 
>> Cheers, 
>> Hao
>> 
>> 
>> 
>> From: Smith, Barry F. <[email protected]>
>> Sent: Thursday, February 6, 2020 7:03 PM
>> To: Hao DONG <[email protected]>
>> Cc: [email protected] <[email protected]>
>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>> Diagonal ILU as PC?
>> 
>> 
>>  Read my comments ALL the way down, they go a long way.
>> 
>>> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
>>> 
>>> Dear Hong and Barry, 
>>> 
>>> Thanks for the suggestions. So there could be some problems in my PETSc 
>>> configuration? - but my PETSc lib was indeed compiled without the debug 
>>> flags (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for 
>>> the compiling and building of PETSc and my own fortran code. My Fortran 
>>> compiling flags are simply like: 
>>> 
>>> -O3 -ffree-line-length-none -fastsse 
>>> 
>>> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but 
>>> not my fortran code, as I don’t have any OMP optimizations in my program). 
>>> Note the performance test results I listed yesterday (e.g. 4.08s with 41 
>>> bcgs iterations.) are without any CSR-array->PETSc translation overhead 
>>> (only include the set and solve part). 
>> 
>>   PETSc doesn't use -openmp in any way for its solvers. Do not use this 
>> option, it may be slowing the code down. Please send configure.log
>> 
>>> 
>>> I have two questions about the performance difference - 
>>> 
>>> 1. Is ilu only factorized once for each iteration, or ilu is performed at 
>>> every outer ksp iteration steps? Sounds unlikely - but if so, this could 
>>> cause some extra overheads. 
>> 
>>   ILU is ONLY done if the matrix has changed (which seems wrong).
>>> 
>>> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each 
>>> iteration step. Not sure how it works in PETSc, but is that possible that 
>>> both the two “half" relative residuals are counted in the output array, 
>>> doubling the number of iterations (but that cannot explain the extra time 
>>> consumed)?
>> 
>>   Yes, PETSc might report them as two, you need to check the exact code.
>> 
>>> 
>>> Anyway, the output with -log_view from the same 278906 by 278906 matrix 
>>> with 3-block D-ILU in PETSc is as follows: 
>>> 
>>> 
>>> ---------------------------------------------- PETSc Performance Summary: 
>>> ----------------------------------------------
>>> 
>>> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>>> donghao Thu Feb  6 09:07:35 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>> 
>>>                          Max       Max/Min     Avg       Total
>>> Time (sec):           4.443e+00     1.000   4.443e+00
>>> Objects:              1.155e+03     1.000   1.155e+03
>>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>>> Flop/sec:             9.703e+08     1.000   9.703e+08  9.703e+08
>>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Reductions:       0.000e+00     0.000
>>> 
>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N 
>>> --> 2N flop
>>>                             and VecAXPY() for complex vectors of length N 
>>> --> 8N flop
>>> 
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages --- 
>>>  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total    Count   %Total 
>>>     Avg         %Total    Count   %Total
>>>  0:      Main Stage: 4.4435e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0% 
>>>  0.000e+00        0.0%  0.000e+00   0.0%
>>> 
>>> ————————————————————————————————————————————————————————————
>>> See the 'Profiling' chapter of the users' manual for details on 
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flop: Max - maximum over all processors
>>>                   Ratio - ratio of maximum to minimum over all processors
>>>    Mess: number of messages sent
>>>    AvgLen: average message length (bytes)
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
>>> PetscLogStagePop().
>>>       %T - percent time in this phase         %F - percent flop in this 
>>> phase
>>>       %M - percent messages in this phase     %L - percent message lengths 
>>> in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
>>> all processors)
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 40 48  0  0  0  40 48  0  0  0  1168
>>> MatSolve             252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   939
>>> MatLUFactorNum         3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   298
>>> MatILUFactorSym        3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecDot                82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0   966
>>> VecDotNorm2           41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  1088
>>> VecNorm               43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  1009
>>> VecCopy                2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> VecAXPY                1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2890
>>> VecAXPBYCZ            82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  1502
>>> VecWAXPY              82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  1293
>>> VecAssemblyBegin       2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSetUp               4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 91100  0  0  0  91100  0  0  0  1071
>>> PCSetUp                4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   156
>>> PCSetUpOnBlocks        1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   214
>>> PCApply               84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>> PCApplyOnBlocks      252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> # I skipped the memory part - the options (and compiler options) are as 
>>> follows: 
>>> 
>>> #PETSc Option Table entries: 
>>> -ksp_type bcgs
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_factor_levels 0
>>> -pc_sub_type ilu
>>> -pc_type bjacobi
>>> #End of PETSc Option Table entries
>>> Compiled with FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>> Configure options: --with-scalar-type=complex --download-mumps 
>>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS=“-O3 
>>> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp” 
>>> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" --     with-debugging=0
>>> -----------------------------------------
>>> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
>>> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>> Using PETSc arch: arch-darwin-c-opt
>>> -----------------------------------------
>>> 
>>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>>> -Qunused-arguments -fvisibility=hidden
>>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>>> -Wno-unused-dummy-argument
>>> 
>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
>>> -----------------------------------------
>>> 
>>> Using C linker: mpicc
>>> Using Fortran linker: mpif90
>>> Using libraries: 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>> 
>>> 
>>> On the other hand, running PETSc with 
>>> 
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type gmres 
>>> -ksp_monitor -ksp_view -log_view
>>> 
>>> For the same problem takes 5.37s and  72 GMRES iterations. Our previous 
>>> testings show that BiCGstab (bcgs in PETSc) is almost always the most 
>>> effective KSP solver for our non-symmetrical complex system. Strangely, the 
>>> system is still using ilu instead of lu for sub blocks. The output is like: 
>> 
>>  -sub_pc_type lu
>> 
>>> 
>>>   0 KSP Residual norm 2.480412407430e+02
>>>   1 KSP Residual norm 8.848059967835e+01
>>>   2 KSP Residual norm 3.415272863261e+01
>>>   3 KSP Residual norm 1.563045190939e+01
>>>   4 KSP Residual norm 6.241296940043e+00
>>>   5 KSP Residual norm 2.739710899854e+00
>>>   6 KSP Residual norm 1.391304148888e+00
>>>   7 KSP Residual norm 7.959262020849e-01
>>>   8 KSP Residual norm 4.828323055231e-01
>>>   9 KSP Residual norm 2.918529739200e-01
>>>  10 KSP Residual norm 1.905508589557e-01
>>>  11 KSP Residual norm 1.291541892702e-01
>>>  12 KSP Residual norm 8.827145774707e-02
>>>  13 KSP Residual norm 6.521331095889e-02
>>>  14 KSP Residual norm 5.095787952595e-02
>>>  15 KSP Residual norm 4.043060387395e-02
>>>  16 KSP Residual norm 3.232590200012e-02
>>>  17 KSP Residual norm 2.593944982216e-02
>>>  18 KSP Residual norm 2.064639483533e-02
>>>  19 KSP Residual norm 1.653916663492e-02
>>>  20 KSP Residual norm 1.334946415452e-02
>>>  21 KSP Residual norm 1.092886880597e-02
>>>  22 KSP Residual norm 8.988004105542e-03
>>>  23 KSP Residual norm 7.466501315240e-03
>>>  24 KSP Residual norm 6.284389135436e-03
>>>  25 KSP Residual norm 5.425231669964e-03
>>>  26 KSP Residual norm 4.766338253084e-03
>>>  27 KSP Residual norm 4.241238878242e-03
>>>  28 KSP Residual norm 3.808113525685e-03
>>>  29 KSP Residual norm 3.449383788116e-03
>>>  30 KSP Residual norm 3.126025526388e-03
>>>  31 KSP Residual norm 2.958328054299e-03
>>>  32 KSP Residual norm 2.802344900403e-03
>>>  33 KSP Residual norm 2.621993580492e-03
>>>  34 KSP Residual norm 2.430066269304e-03
>>>  35 KSP Residual norm 2.259043079597e-03
>>>  36 KSP Residual norm 2.104287972986e-03
>>>  37 KSP Residual norm 1.952916080045e-03
>>>  38 KSP Residual norm 1.804988937999e-03
>>>  39 KSP Residual norm 1.643302117377e-03
>>>  40 KSP Residual norm 1.471661332036e-03
>>>  41 KSP Residual norm 1.286445911163e-03
>>>  42 KSP Residual norm 1.127543025848e-03
>>>  43 KSP Residual norm 9.777148275484e-04
>>>  44 KSP Residual norm 8.293314450006e-04
>>>  45 KSP Residual norm 6.989331136622e-04
>>>  46 KSP Residual norm 5.852307780220e-04
>>>  47 KSP Residual norm 4.926715539762e-04
>>>  48 KSP Residual norm 4.215941372075e-04
>>>  49 KSP Residual norm 3.699489548162e-04
>>>  50 KSP Residual norm 3.293897163533e-04
>>>  51 KSP Residual norm 2.959954542998e-04
>>>  52 KSP Residual norm 2.700193032414e-04
>>>  53 KSP Residual norm 2.461789791204e-04
>>>  54 KSP Residual norm 2.218839085563e-04
>>>  55 KSP Residual norm 1.945154309976e-04
>>>  56 KSP Residual norm 1.661128781744e-04
>>>  57 KSP Residual norm 1.413198766258e-04
>>>  58 KSP Residual norm 1.213984003195e-04
>>>  59 KSP Residual norm 1.044317450754e-04
>>>  60 KSP Residual norm 8.919957502977e-05
>>>  61 KSP Residual norm 8.042584301275e-05
>>>  62 KSP Residual norm 7.292784493581e-05
>>>  63 KSP Residual norm 6.481935501872e-05
>>>  64 KSP Residual norm 5.718564652679e-05
>>>  65 KSP Residual norm 5.072589750116e-05
>>>  66 KSP Residual norm 4.487930741285e-05
>>>  67 KSP Residual norm 3.941040674119e-05
>>>  68 KSP Residual norm 3.492873281291e-05
>>>  69 KSP Residual norm 3.103798339845e-05
>>>  70 KSP Residual norm 2.822943237409e-05
>>>  71 KSP Residual norm 2.610615023776e-05
>>>  72 KSP Residual norm 2.441692671173e-05
>>> KSP Object: 1 MPI processes
>>>   type: gmres
>>>     restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization 
>>> with no iterative refinement
>>>     happy breakdown tolerance 1e-30
>>>   maximum iterations=150, nonzero initial guess
>>>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>   left preconditioning
>>>   using PRECONDITIONED norm type for convergence test
>>> PC Object: 1 MPI processes
>>>   type: bjacobi
>>>     number of blocks = 3
>>>     Local solve is same for all blocks, in the following KSP and PC objects:
>>>     KSP Object: (sub_) 1 MPI processes
>>>       type: preonly
>>>       maximum iterations=10000, initial guess is zero
>>>       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>       left preconditioning
>>>       using NONE norm type for convergence test
>>>     PC Object: (sub_) 1 MPI processes
>>>       type: ilu
>>>         out-of-place factorization
>>>         0 levels of fill
>>>         tolerance for zero pivot 2.22045e-14
>>>         matrix ordering: natural
>>>         factor fill ratio given 1., needed 1.
>>>           Factored matrix follows:
>>>             Mat Object: 1 MPI processes
>>>               type: seqaij
>>>               rows=92969, cols=92969
>>>               package used to perform factorization: petsc
>>>               total: nonzeros=638417, allocated nonzeros=638417
>>>               total number of mallocs used during MatSetValues calls=0
>>>                 not using I-node routines
>>>       linear system matrix = precond matrix:
>>>       Mat Object: 1 MPI processes
>>>         type: seqaij
>>>         rows=92969, cols=92969
>>>         total: nonzeros=638417, allocated nonzeros=638417
>>>         total number of mallocs used during MatSetValues calls=0
>>>           not using I-node routines
>>>   linear system matrix = precond matrix:
>>>   Mat Object: 1 MPI processes
>>>     type: mpiaij
>>>     rows=278906, cols=278906
>>>     total: nonzeros=3274027, allocated nonzeros=3274027
>>>     total number of mallocs used during MatSetValues calls=0
>>>       not using I-node (on process 0) routines
>>> ...
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 28 24  0  0  0  28 24  0  0  0  1189
>>> MatSolve             228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   944
>> 
>>   These flop rates are ok, but not great. Perhaps an older machine.
>> 
>>> MatLUFactorNum         3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   290
>>> MatILUFactorSym        3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecMDot               72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 21 28  0  0  0  21 28  0  0  0  1877
>> 
>>    21 percent of the time in VecMDOT this is huge for s sequential fun. I 
>> think maybe you are using a terrible OpenMP BLAS? 
>> 
>>    Send configure.log 
>> 
>> 
>>> VecNorm               76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  1007
>>> VecScale              75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  4587
>>> VecCopy                3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> VecAXPY                6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3575
>>> VecMAXPY              75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 18 30  0  0  0  18 30  0  0  0  2405
>>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecNormalize          75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  1360
>>> KSPSetUp               4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 93100  0  0  0  93100  0  0  0  1489
>>> KSPGMRESOrthog        72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 37 57  0  0  0  37 57  0  0  0  2110
>>> PCSetUp                4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   153
>>> PCSetUpOnBlocks        1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   209
>>> PCApply               76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>> PCApplyOnBlocks      228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>> ————————————————————————————————————————————————————————————
>>> ...
>>> #PETSc Option Table entries:
>>> -ksp_monitor
>>> -ksp_type gmres
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_sub_type lu
>>> -pc_type bjacobi
>>> #End of PETSc Option Table entries
>>> ...
>>> 
>>> Does any of the setup/output ring a bell? 
>>> 
>>> BTW, out of curiosity - what is a “I-node” routine?
>>> 
>>> 
>>> Cheers, 
>>> Hao
>>> 
>>> 
>>> From: Smith, Barry F. <[email protected]>
>>> Sent: Wednesday, February 5, 2020 9:42 PM
>>> To: Hao DONG <[email protected]>
>>> Cc: [email protected] <[email protected]>
>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>> Diagonal ILU as PC?
>>> 
>>> 
>>> 
>>>> On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
>>>> 
>>>> Thanks a lot for your suggestions, Hong and Barry - 
>>>> 
>>>> As you suggested, I first tried the LU direct solvers (built-in and MUMPs) 
>>>> out this morning, which work perfectly, albeit slow. As my problem itself 
>>>> is a part of a PDE based optimization, the exact solution in the searching 
>>>> procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, 
>>>> etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the 
>>>> KSP just converges in one or two iterations, as expected. 
>>>> 
>>>> I added -kspview option for the D-ILU test (still with Block Jacobi as 
>>>> preconditioner and bcgs as KSP solver). The KSPview output from one of the 
>>>> examples in a toy model looks like: 
>>>> 
>>>> KSP Object: 1 MPI processes
>>>>   type: bcgs
>>>>   maximum iterations=120, nonzero initial guess
>>>>   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>>   left preconditioning
>>>>   using PRECONDITIONED norm type for convergence test
>>>> PC Object: 1 MPI processes
>>>>   type: bjacobi
>>>>     number of blocks = 3
>>>>     Local solve is same for all blocks, in the following KSP and PC 
>>>> objects:
>>>>     KSP Object: (sub_) 1 MPI processes
>>>>       type: preonly
>>>>       maximum iterations=10000, initial guess is zero
>>>>       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>>       left preconditioning
>>>>       using NONE norm type for convergence test
>>>>     PC Object: (sub_) 1 MPI processes
>>>>       type: ilu
>>>>         out-of-place factorization
>>>>         0 levels of fill
>>>>         tolerance for zero pivot 2.22045e-14
>>>>         matrix ordering: natural
>>>>         factor fill ratio given 1., needed 1.
>>>>           Factored matrix follows:
>>>>             Mat Object: 1 MPI processes
>>>>               type: seqaij
>>>>               rows=11294, cols=11294
>>>>               package used to perform factorization: petsc
>>>>               total: nonzeros=76008, allocated nonzeros=76008
>>>>               total number of mallocs used during MatSetValues calls=0
>>>>                 not using I-node routines
>>>>       linear system matrix = precond matrix:
>>>>       Mat Object: 1 MPI processes
>>>>         type: seqaij
>>>>         rows=11294, cols=11294
>>>>         total: nonzeros=76008, allocated nonzeros=76008
>>>>         total number of mallocs used during MatSetValues calls=0
>>>>           not using I-node routines
>>>>   linear system matrix = precond matrix:
>>>>   Mat Object: 1 MPI processes
>>>>     type: mpiaij
>>>>     rows=33880, cols=33880
>>>>     total: nonzeros=436968, allocated nonzeros=436968
>>>>     total number of mallocs used during MatSetValues calls=0
>>>>       not using I-node (on process 0) routines
>>>> 
>>>> do you see something wrong with my setup?
>>>> 
>>>> I also tried a quick performance test with a small 278906 by 278906 matrix 
>>>> (3850990 nnz) with the following parameters: 
>>>> 
>>>> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type 
>>>> ilu -ksp_view
>>>> 
>>>> Reducing the relative residual to 1E-7 
>>>> 
>>>> Took 4.08s with 41 bcgs iterations. 
>>>> 
>>>> Merely changing the -pc_bjacobi_local_blocks to 6 
>>>> 
>>>> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with 
>>>> 101 bcgs iterations.
>>> 
>>>    This is normal. more blocks slower convergence
>>>> 
>>>> As a reference, my home-brew Fortran code solves the same problem (3-block 
>>>> D-ILU0) in 
>>>> 
>>>> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
>>>> 
>>>    Run the PETSc code with optimization ./configure --with-debugging=0  an 
>>> run the code with -log_view this will show where the PETSc code is spending 
>>> the time (send it to use)
>>> 
>>> 
>>>> 
>>>> 
>>>> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a 
>>>> user is allowed to provide his own (separate) L and U Mat for 
>>>> preconditioning (like how it works in Matlab solvers), e.g.
>>>> 
>>>> x = qmr(A,b,Tol,MaxIter,L,U,x)
>>>> 
>>>> As I already explicitly constructed the L and U matrices in Fortran, it is 
>>>> not hard to convert them to Mat in Petsc to test Petsc and my Fortran code 
>>>> head-to-head. In that case, the A, b, x, and L/U are all identical, it 
>>>> would be easier to see where the problem came from. 
>>>> 
>>>> 
>>>     No, we don't provide this kind of support
>>> 
>>> 
>>>> 
>>>> BTW, there is another thing I wondered - is there a way to output residual 
>>>> in unpreconditioned norm? I tried to
>>>> 
>>>> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
>>>> 
>>>> But always get an error that current ksp does not support unpreconditioned 
>>>> in LEFT/RIGHT (either way). Is it possible to do that (output 
>>>> unpreconditioned residual) in PETSc at all?
>>> 
>>>   -ksp_monitor_true_residual    You can also run GMRES (and some other 
>>> methods) with right preconditioning, -ksp_pc_side right  then the residual 
>>> computed is by the algorithm the unpreconditioned residual
>>> 
>>>   KSPSetNormType sets the norm used in the algorithm, it generally always 
>>> has to left or right, only a couple algorithms support unpreconditioned 
>>> directly.
>>> 
>>>   Barry
>>> 
>>> 
>>>> 
>>>> Cheers, 
>>>> Hao
>>>> 
>>>> 
>>>> From: Smith, Barry F. <[email protected]>
>>>> Sent: Tuesday, February 4, 2020 8:27 PM
>>>> To: Hao DONG <[email protected]>
>>>> Cc: [email protected] <[email protected]>
>>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>>> Diagonal ILU as PC?
>>>> 
>>>> 
>>>> 
>>>>> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
>>>>> 
>>>>> Dear all, 
>>>>> 
>>>>> 
>>>>> I have a few questions about the implementation of diagonal ILU PC in 
>>>>> PETSc. I want to solve a very simple system with KSP (in parallel), the 
>>>>> nature of the system (finite difference time-harmonic Maxwell) is 
>>>>> probably not important to the question itself. Long story short, I just 
>>>>> need to solve a set of equations of Ax = b with a block diagonal system 
>>>>> matrix, like (not sure if the mono font works): 
>>>>> 
>>>>>   |X    |  
>>>>> A =|  Y  |  
>>>>>   |    Z| 
>>>>> 
>>>>> Note that A is not really block-diagonal, it’s just a multi-diagonal 
>>>>> matrix determined by the FD mesh, where most elements are close to 
>>>>> diagonal. So instead of a full ILU decomposition, a D-ILU is a good 
>>>>> approximation as a preconditioner, and the number of blocks should not 
>>>>> matter too much: 
>>>>> 
>>>>>    |Lx      |         |Ux      |
>>>>> L = |   Ly   | and U = |   Uy   |
>>>>>    |      Lz|         |      Uz|
>>>>> 
>>>>> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
>>>>> blocks) is quite efficient with Krylov subspace methods like BiCGstab or 
>>>>> QMR in my sequential Fortran/Matlab code. 
>>>>> 
>>>>> So like most users, I am looking for a parallel implement with this 
>>>>> problem in PETSc. After looking through the manual, it seems that the 
>>>>> most straightforward way to do it is through PCBJACOBI. Not sure I 
>>>>> understand it right, I just setup a 3-block PCJACOBI and give each of the 
>>>>> block a KSP with PCILU. Is this supposed to be equivalent to my D-ILU 
>>>>> preconditioner? My little block of fortran code would look like: 
>>>>> ...
>>>>>      call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
>>>>>     &     isubs,ierr)
>>>>>      call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
>>>>>     &    isubs(istart:iend),ierr)
>>>>>      ! set up the block jacobi structure
>>>>>      call KSPSetup(ksp_local,ierr)
>>>>>      ! allocate sub ksps
>>>>>      allocate(ksp_sub(Nsub))
>>>>>      call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
>>>>>     &     ksp_sub,ierr)
>>>>>      do i=1,Nsub
>>>>>          call KSPGetPC(ksp_sub(i),pc_sub,ierr)
>>>>>          !ILU preconditioner
>>>>>          call PCSetType(pc_sub,ptype,ierr)
>>>>>          call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
>>>>>          call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
>>>>>      end do
>>>>>      call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
>>>>>     &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
>>>>> … 
>>>> 
>>>>     This code looks essentially right. You should call with -ksp_view to 
>>>> see exactly what PETSc is using for a solver. 
>>>> 
>>>>> 
>>>>> I understand that the parallel performance may not be comparable, so I 
>>>>> first set up a one-process test (with MPIAij, but all the rows are local 
>>>>> since there is only one process). The system is solved without any 
>>>>> problem (identical results within error). But the performance is actually 
>>>>> a lot worse (code built without debugging flags in performance tests) 
>>>>> than my own home-brew implementation in Fortran (I wrote my own ILU0 in 
>>>>> CSR sparse matrix format), which is hard to believe. I suspect the 
>>>>> difference is from the PC as the PETSc version took much more BiCGstab 
>>>>> iterations (60-ish vs 100-ish) to converge to the same relative tol. 
>>>> 
>>>>   PETSc uses GMRES by default with a restart of 30 and left 
>>>> preconditioning. It could be different exact choices in the solver (which 
>>>> is why -ksp_view is so useful) can explain the differences in the runs 
>>>> between your code and PETSc's
>>>>> 
>>>>> This is further confirmed when I change the setup of D-ILU (using 6 or 9 
>>>>> blocks instead of 3). While my Fortran/Matlab codes see minimal 
>>>>> performance difference (<5%) when I play with the D-ILU setup, increasing 
>>>>> the number of D-ILU blocks from 3 to 9 caused the ksp setup with 
>>>>> PCBJACOBI to suffer a performance decrease of more than 50% in sequential 
>>>>> test.
>>>> 
>>>>   This is odd, the more blocks the smaller each block so the quicker the 
>>>> ILU set up should be. You can run various cases with -log_view and send us 
>>>> the output to see what is happening at each part of the computation in 
>>>> time.
>>>> 
>>>>> So my implementation IS somewhat different in PETSc. Do I miss something 
>>>>> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of 
>>>>> how PCBJACOBI works in PETSc? 
>>>> 
>>>>   Probably not.
>>>>> 
>>>>> If this is not the right way to implement a block diagonal ILU as 
>>>>> (parallel) PC, please kindly point me to the right direction. I searched 
>>>>> through the mail list to find some answers, only to find a couple of 
>>>>> similar questions... An example would be nice.
>>>> 
>>>>   You approach seems fundamentally right but I cannot be sure of possible 
>>>> bugs.
>>>>> 
>>>>> On the other hand, does PETSc support a simple way to use explicit L/U 
>>>>> matrix as a preconditioner? I can import the  D-ILU matrix (I already 
>>>>> converted my A matrix into Mat) constructed in my Fortran code to make a 
>>>>> better comparison. Or do I have to construct my own PC using PCshell? If 
>>>>> so, is there a good tutorial/example to learn about how to use PCSHELL 
>>>>> (in a more advanced sense, like how to setup pc side and transpose)? 
>>>> 
>>>>   Not sure what you mean by explicit L/U matrix as a preconditioner. As 
>>>> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or 
>>>> Pastix as the solver. You do not need any shell code. You simply need to 
>>>> set the PCType to lu 
>>>> 
>>>>   You can also set all this options from the command line and don't need 
>>>> to write the code specifically. So call KSPSetFromOptions() and then for 
>>>> example
>>>> 
>>>>    -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this 
>>>> last one is applied to each block so you could use -pc_type lu and it 
>>>> would use lu on each block.) 
>>>> 
>>>>   -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do 
>>>> parallel LU with mumps)
>>>> 
>>>> By not hardwiring in the code and just using options you can test out 
>>>> different cases much quicker
>>>> 
>>>> Use -ksp_view to make sure that is using the solver the way you expect.
>>>> 
>>>> Barry
>>>> 
>>>> 
>>>> 
>>>>   Barry
>>>> 
>>>>> 
>>>>> Thanks in advance, 
>>>>> 
>>>>> Hao
>> 
>> <configure.log><configure.log>
> 

Reply via email to