You should google for preconditioners/solvers for 3D time-harmonic Maxwell 
equation arises from stagger-grid finite difference.  Maxwell has its own 
particular structure and difficulties with iterative solvers depending on the 
parameters.  Also see page 87 in 
https://www.mcs.anl.gov/petsc/petsc-current/docs/manual.pdf


    Barry

  -pc_type asm may work well for a handful of processors. 

> On Feb 10, 2020, at 8:47 AM, Hao DONG <[email protected]> wrote:
> 
> Hi Barry, 
> 
> Thank you for you suggestions (and insights)! Indeed my initial motivation to 
> try out PETSc is the different methods. As my matrix pattern is relatively 
> simple (3D time-harmonic Maxwell equation arises from stagger-grid finite 
> difference), also considering the fact that I am not wealthy enough to 
> utilize the direct solvers, I was looking for a fast Krylov subspace method / 
> preconditioner that scale well with, say, tens of cpu cores. 
> 
> As a simple block-Jacobian preconditioner seems to lose its efficiency with 
> more than a handful of blocks, I planned to look into other 
> methods/preconditioners, e.g. multigrid (as preconditioner) or domain 
> decomposition methods. But I will probably need to look through a number of 
> literatures before laying my hands on those (or bother you with more 
> questions!). Anyway, thanks again for your kind help. 
> 
> 
> All the best, 
> Hao
> 
>> On Feb 8, 2020, at 8:02 AM, Smith, Barry F. <[email protected]> wrote:
>> 
>> 
>> 
>>> On Feb 7, 2020, at 7:44 AM, Hao DONG <[email protected]> wrote:
>>> 
>>> Thanks, Barry, I really appreciate your help - 
>>> 
>>> I removed the OpenMP flags and rebuilt PETSc. So the problem is from the 
>>> BLAS lib I linked?
>> 
>> Yes, the openmp causes it to run in parallel, but the problem is not big 
>> enough and the machine is not good enough for parallel BLAS to speed things 
>> up, instead it slows things down a lot. We see this often, parallel BLAS 
>> must be used with care
>> 
>>> Not sure which version my BLAS is, though… But I also included the 
>>> -download-Scalapack option. Shouldn’t that enable linking with PBLAS 
>>> automatically?
>>> 
>>> After looking at the bcgs code in PETSc, I suppose the iteration residual 
>>> recorded is indeed recorded twice per one "actual iteration”. So that can 
>>> explain the difference of iteration numbers. 
>>> 
>>> My laptop is indeed an old machine (MBP15 mid-2014). I just cannot work 
>>> with vi without a physical ESC key...
>> 
>>  The latest has a physical ESC, I am stuff without the ESC for a couple more 
>> years.
>> 
>>> I have attached the configure.log -didn’t know that it is so large! 
>>> 
>>> Anyway, it seems that the removal of -openmp changes quite a lot of things, 
>>> the performance is indeed getting much better - the flop/sec increases by a 
>>> factor of 3. Still, I am getting 20 percent of VecMDot, but no VecMDot in 
>>> BCGS all (see output below), is that a feature of gmres method? 
>> 
>>  Yes, GMRES orthogonalizes against the last restart directions which uses 
>> these routines while  BCGS does not, this is why BCGS is cheaper per 
>> iteration.
>> 
>>   PETSc is no faster than your code because the algorithm is the same, the 
>> compilers the same, and the hardware the same. No way to have clever tricks 
>> for PETSc to be much faster. What PETS provides is a huge variety of tested 
>> algorithms that no single person could code on their own. Anything in PETSc 
>> you could code yourself if you had endless time and get basically the same 
>> performance.
>> 
>> Barry
>> 
>> 
>>> 
>>> here is the output of the same problem with:  
>>> 
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type 
>>> gmres -ksp_monitor -ksp_view
>>> 
>>> 
>>> ---------------------------------------------- PETSc Performance Summary: 
>>> ----------------------------------------------
>>> 
>>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>>> donghao Fri Feb  7 10:26:19 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>> 
>>>                         Max       Max/Min     Avg       Total
>>> Time (sec):           2.520e+00     1.000   2.520e+00
>>> Objects:              1.756e+03     1.000   1.756e+03
>>> Flop:                 7.910e+09     1.000   7.910e+09  7.910e+09
>>> Flop/sec:             3.138e+09     1.000   3.138e+09  3.138e+09
>>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Reductions:       0.000e+00     0.000
>>> 
>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>> (multiply/divide/add/subtract)
>>>                            e.g., VecAXPY() for real vectors of length N --> 
>>> 2N flop
>>>                            and VecAXPY() for complex vectors of length N 
>>> --> 8N flop
>>> 
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages --- 
>>>  -- Message Lengths --  -- Reductions --
>>>                        Avg     %Total     Avg     %Total    Count   %Total  
>>>    Avg         %Total    Count   %Total
>>> 0:      Main Stage: 2.5204e+00 100.0%  7.9096e+09 100.0%  0.000e+00   0.0%  
>>> 0.000e+00        0.0%  0.000e+00   0.0%
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> …
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 3.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               75 1.0 6.2884e-01 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 25 24  0  0  0  25 24  0  0  0  2991
>>> MatSolve             228 1.0 4.4164e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2445
>>> MatLUFactorNum         3 1.0 4.1317e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   574
>>> MatILUFactorSym        3 1.0 2.3858e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 4.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 1.5067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 2.4558e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 1.3290e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecMDot               72 1.0 4.9875e-01 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 20 28  0  0  0  20 28  0  0  0  4509
>>> VecNorm               76 1.0 6.6666e-02 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  2544
>>> VecScale              75 1.0 1.7982e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  1  0  0  0   1  1  0  0  0  4653
>>> VecCopy                3 1.0 1.5080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               276 1.0 9.6784e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  0  0  0  0   4  0  0  0  0     0
>>> VecAXPY                6 1.0 3.6860e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3632
>>> VecMAXPY              75 1.0 4.0490e-01 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 16 30  0  0  0  16 30  0  0  0  5951
>>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       76 1.0 5.3800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecNormalize          75 1.0 8.3690e-02 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  2999
>>> KSPSetUp               4 1.0 1.1663e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSolve               1 1.0 2.2119e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 88100  0  0  0  88100  0  0  0  3576
>>> KSPGMRESOrthog        72 1.0 8.7843e-01 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 35 57  0  0  0  35 57  0  0  0  5121
>>> PCSetUp                4 1.0 9.2448e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  0  0  0  0   4  0  0  0  0   257
>>> PCSetUpOnBlocks        1 1.0 6.6597e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   356
>>> PCApply               76 1.0 4.6281e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2333
>>> PCApplyOnBlocks      228 1.0 4.6262e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 18 14  0  0  0  18 14  0  0  0  2334
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> Average time to get PetscTime(): 1e-07
>>> #PETSc Option Table entries:
>>> -I LBFGS
>>> -ksp_type gmres
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_type bjacobi
>>> -sub_pc_type ilu
>>> #End of PETSc Option Table entries
>>> Compiled with FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>> Configure options: --with-scalar-type=complex --download-mumps 
>>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS="-O3 
>>> -ffree-line-length-0 -msse2" --COPTFLAGS="-O3 -msse2" --CXXOPTFLAGS="-O3 
>>> -msse2" --with-debugging=0
>>> -----------------------------------------
>>> Libraries compiled on 2020-02-07 10:07:42 on Haos-MBP
>>> Machine characteristics: Darwin-19.3.0-x86_64-i386-64bit
>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>> Using PETSc arch: arch-darwin-c-opt
>>> -----------------------------------------
>>> 
>>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>>> -Qunused-arguments -fvisibility=hidden -O3 -msse2
>>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>>> -Wno-unused-dummy-argument -O3 -ffree-line-length-0 -         msse2
>>> -----------------------------------------
>>> 
>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-  c-opt/include
>>> -----------------------------------------
>>> 
>>> Using C linker: mpicc
>>> Using Fortran linker: mpif90
>>> Using libraries: 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>> -----------------------------------------
>>> 
>>> 
>>> 
>>> The BCGS solver performance is now comparable to my own Fortran code 
>>> (1.84s). Still, I feel that there is something wrong hidden somewhere in my 
>>> setup - a professional lib should to perform better, I believe. Any other 
>>> ideas that I can look into? Interestingly there is no VecMDot operation at 
>>> all! Here is the output with the option of: 
>>> 
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type bcgs 
>>> -ksp_monitor -ksp_view
>>> 
>>> 
>>> 
>>> 
>>> ---------------------------------------------- PETSc Performance Summary: 
>>> ----------------------------------------------
>>> 
>>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>>> donghao Fri Feb  7 10:38:00 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>> 
>>>                         Max       Max/Min     Avg       Total
>>> Time (sec):           2.187e+00     1.000   2.187e+00
>>> Objects:              1.155e+03     1.000   1.155e+03
>>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>>> Flop/sec:             1.971e+09     1.000   1.971e+09  1.971e+09
>>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>>> MPI Reductions:       0.000e+00     0.000
>>> 
>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>> (multiply/divide/add/subtract)
>>>                            e.g., VecAXPY() for real vectors of length N --> 
>>> 2N flop
>>>                            and VecAXPY() for complex vectors of length N 
>>> --> 8N flop
>>> 
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages --- 
>>>  -- Message Lengths --  -- Reductions --
>>>                        Avg     %Total     Avg     %Total    Count   %Total  
>>>    Avg         %Total    Count   %Total
>>> 0:      Main Stage: 2.1870e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0%  
>>> 0.000e+00        0.0%  0.000e+00   0.0%
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSidedF         1 1.0 2.2000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatMult               83 1.0 7.8726e-01 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 36 48  0  0  0  36 48  0  0  0  2644
>>> MatSolve             252 1.0 5.5656e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 25 28  0  0  0  25 28  0  0  0  2144
>>> MatLUFactorNum         3 1.0 4.5115e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   526
>>> MatILUFactorSym        3 1.0 2.5103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatAssemblyBegin       5 1.0 3.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatAssemblyEnd         5 1.0 1.5709e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetRowIJ            3 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatCreateSubMats       1 1.0 2.8989e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> MatGetOrdering         3 1.0 1.1200e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> MatView                3 1.0 1.2600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecDot                82 1.0 8.9328e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  2048
>>> VecDotNorm2           41 1.0 9.9019e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  5  4  0  0  0   5  4  0  0  0  1848
>>> VecNorm               43 1.0 3.9988e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  2399
>>> VecCopy                2 1.0 1.1150e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecSet               271 1.0 4.2833e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>> VecAXPY                1 1.0 5.9200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3769
>>> VecAXPBYCZ            82 1.0 1.1448e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  3196
>>> VecWAXPY              82 1.0 6.7460e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  2712
>>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> VecScatterBegin       84 1.0 5.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>> KSPSetUp               4 1.0 1.4765e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>> KSPSolve               1 1.0 1.8514e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 85100  0  0  0  85100  0  0  0  2329
>>> PCSetUp                4 1.0 1.0193e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  5  1  0  0  0   5  1  0  0  0   233
>>> PCSetUpOnBlocks        1 1.0 7.1421e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   332
>>> PCApply               84 1.0 5.7927e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2060
>>> PCApplyOnBlocks      252 1.0 5.7902e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>> 0.0e+00 26 28  0  0  0  26 28  0  0  0  2061
>>> ------------------------------------------------------------------------------------------------------------------------
>>> 
>>> 
>>> Cheers, 
>>> Hao
>>> 
>>> 
>>> 
>>> From: Smith, Barry F. <[email protected]>
>>> Sent: Thursday, February 6, 2020 7:03 PM
>>> To: Hao DONG <[email protected]>
>>> Cc: [email protected] <[email protected]>
>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>> Diagonal ILU as PC?
>>> 
>>> 
>>> Read my comments ALL the way down, they go a long way.
>>> 
>>>> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
>>>> 
>>>> Dear Hong and Barry, 
>>>> 
>>>> Thanks for the suggestions. So there could be some problems in my PETSc 
>>>> configuration? - but my PETSc lib was indeed compiled without the debug 
>>>> flags (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for 
>>>> the compiling and building of PETSc and my own fortran code. My Fortran 
>>>> compiling flags are simply like: 
>>>> 
>>>> -O3 -ffree-line-length-none -fastsse 
>>>> 
>>>> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but 
>>>> not my fortran code, as I don’t have any OMP optimizations in my program). 
>>>> Note the performance test results I listed yesterday (e.g. 4.08s with 41 
>>>> bcgs iterations.) are without any CSR-array->PETSc translation overhead 
>>>> (only include the set and solve part). 
>>> 
>>>  PETSc doesn't use -openmp in any way for its solvers. Do not use this 
>>> option, it may be slowing the code down. Please send configure.log
>>> 
>>>> 
>>>> I have two questions about the performance difference - 
>>>> 
>>>> 1. Is ilu only factorized once for each iteration, or ilu is performed at 
>>>> every outer ksp iteration steps? Sounds unlikely - but if so, this could 
>>>> cause some extra overheads. 
>>> 
>>>  ILU is ONLY done if the matrix has changed (which seems wrong).
>>>> 
>>>> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each 
>>>> iteration step. Not sure how it works in PETSc, but is that possible that 
>>>> both the two “half" relative residuals are counted in the output array, 
>>>> doubling the number of iterations (but that cannot explain the extra time 
>>>> consumed)?
>>> 
>>>  Yes, PETSc might report them as two, you need to check the exact code.
>>> 
>>>> 
>>>> Anyway, the output with -log_view from the same 278906 by 278906 matrix 
>>>> with 3-block D-ILU in PETSc is as follows: 
>>>> 
>>>> 
>>>> ---------------------------------------------- PETSc Performance Summary: 
>>>> ----------------------------------------------
>>>> 
>>>> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by 
>>>> donghao Thu Feb  6 09:07:35 2020
>>>> Using Petsc Release Version 3.12.3, unknown
>>>> 
>>>>                         Max       Max/Min     Avg       Total
>>>> Time (sec):           4.443e+00     1.000   4.443e+00
>>>> Objects:              1.155e+03     1.000   1.155e+03
>>>> Flop:                 4.311e+09     1.000   4.311e+09  4.311e+09
>>>> Flop/sec:             9.703e+08     1.000   9.703e+08  9.703e+08
>>>> MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
>>>> MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
>>>> MPI Reductions:       0.000e+00     0.000
>>>> 
>>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>>> (multiply/divide/add/subtract)
>>>>                            e.g., VecAXPY() for real vectors of length N 
>>>> --> 2N flop
>>>>                            and VecAXPY() for complex vectors of length N 
>>>> --> 8N flop
>>>> 
>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages 
>>>> ---  -- Message Lengths --  -- Reductions --
>>>>                        Avg     %Total     Avg     %Total    Count   %Total 
>>>>     Avg         %Total    Count   %Total
>>>> 0:      Main Stage: 4.4435e+00 100.0%  4.3113e+09 100.0%  0.000e+00   0.0% 
>>>>  0.000e+00        0.0%  0.000e+00   0.0%
>>>> 
>>>> ————————————————————————————————————————————————————————————
>>>> See the 'Profiling' chapter of the users' manual for details on 
>>>> interpreting output.
>>>> Phase summary info:
>>>>   Count: number of times phase was executed
>>>>   Time and Flop: Max - maximum over all processors
>>>>                  Ratio - ratio of maximum to minimum over all processors
>>>>   Mess: number of messages sent
>>>>   AvgLen: average message length (bytes)
>>>>   Reduct: number of global reductions
>>>>   Global: entire computation
>>>>   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
>>>> PetscLogStagePop().
>>>>      %T - percent time in this phase         %F - percent flop in this 
>>>> phase
>>>>      %M - percent messages in this phase     %L - percent message lengths 
>>>> in this phase
>>>>      %R - percent reductions in this phase
>>>>   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
>>>> all processors)
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop                        
>>>>       --- Global ---  --- Stage ----  Total
>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> --- Event Stage 0: Main Stage
>>>> 
>>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatMult               83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 40 48  0  0  0  40 48  0  0  0  1168
>>>> MatSolve             252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   939
>>>> MatLUFactorNum         3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  1  0  0  0   2  1  0  0  0   298
>>>> MatILUFactorSym        3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatAssemblyBegin       5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAssemblyEnd         5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatCreateSubMats       1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatGetOrdering         3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatView                3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecDot                82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0   966
>>>> VecDotNorm2           41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  4  4  0  0  0   4  4  0  0  0  1088
>>>> VecNorm               43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  2  0  0  0   2  2  0  0  0  1009
>>>> VecCopy                2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet               271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> VecAXPY                1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  2890
>>>> VecAXPBYCZ            82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  5  8  0  0  0   5  8  0  0  0  1502
>>>> VecWAXPY              82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  4  0  0  0   3  4  0  0  0  1293
>>>> VecAssemblyBegin       2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecScatterBegin       84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSetUp               4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve               1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 91100  0  0  0  91100  0  0  0  1071
>>>> PCSetUp                4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   156
>>>> PCSetUpOnBlocks        1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  1  0  0  0   3  1  0  0  0   214
>>>> PCApply               84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>>> PCApplyOnBlocks      252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 29 28  0  0  0  29 28  0  0  0   924
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> # I skipped the memory part - the options (and compiler options) are as 
>>>> follows: 
>>>> 
>>>> #PETSc Option Table entries: 
>>>> -ksp_type bcgs
>>>> -ksp_view
>>>> -log_view
>>>> -pc_bjacobi_local_blocks 3
>>>> -pc_factor_levels 0
>>>> -pc_sub_type ilu
>>>> -pc_type bjacobi
>>>> #End of PETSc Option Table entries
>>>> Compiled with FORTRAN kernels
>>>> Compiled with full precision matrices (default)
>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>>> Configure options: --with-scalar-type=complex --download-mumps 
>>>> --download-scalapack --with-fortran-kernels=1 --          FOPTFLAGS=“-O3 
>>>> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp” 
>>>> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" --     with-debugging=0
>>>> -----------------------------------------
>>>> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
>>>> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
>>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>>> Using PETSc arch: arch-darwin-c-opt
>>>> -----------------------------------------
>>>> 
>>>> Using C compiler: mpicc  -Wall -Wwrite-strings -Wno-strict-aliasing 
>>>> -Wno-unknown-pragmas -fstack-protector -fno-stack-   check 
>>>> -Qunused-arguments -fvisibility=hidden
>>>> Using Fortran compiler: mpif90  -Wall -ffree-line-length-0 
>>>> -Wno-unused-dummy-argument
>>>> 
>>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include 
>>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
>>>> -----------------------------------------
>>>> 
>>>> Using C linker: mpicc
>>>> Using Fortran linker: mpif90
>>>> Using libraries: 
>>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>>> -L/Users/donghao/src/git/PETSc-   current/arch-darwin-c-opt/lib -lpetsc 
>>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>>> -L/Users/    donghao/src/git/PETSc-current/arch-darwin-c-opt/lib 
>>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/    lib 
>>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib 
>>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ 
>>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 
>>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-        
>>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 
>>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps -   ldmumps -lsmumps 
>>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl 
>>>> -lmpi_usempif08 -                 lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi 
>>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>>> 
>>>> 
>>>> On the other hand, running PETSc with 
>>>> 
>>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type 
>>>> gmres -ksp_monitor -ksp_view -log_view
>>>> 
>>>> For the same problem takes 5.37s and  72 GMRES iterations. Our previous 
>>>> testings show that BiCGstab (bcgs in PETSc) is almost always the most 
>>>> effective KSP solver for our non-symmetrical complex system. Strangely, 
>>>> the system is still using ilu instead of lu for sub blocks. The output is 
>>>> like: 
>>> 
>>> -sub_pc_type lu
>>> 
>>>> 
>>>>  0 KSP Residual norm 2.480412407430e+02
>>>>  1 KSP Residual norm 8.848059967835e+01
>>>>  2 KSP Residual norm 3.415272863261e+01
>>>>  3 KSP Residual norm 1.563045190939e+01
>>>>  4 KSP Residual norm 6.241296940043e+00
>>>>  5 KSP Residual norm 2.739710899854e+00
>>>>  6 KSP Residual norm 1.391304148888e+00
>>>>  7 KSP Residual norm 7.959262020849e-01
>>>>  8 KSP Residual norm 4.828323055231e-01
>>>>  9 KSP Residual norm 2.918529739200e-01
>>>> 10 KSP Residual norm 1.905508589557e-01
>>>> 11 KSP Residual norm 1.291541892702e-01
>>>> 12 KSP Residual norm 8.827145774707e-02
>>>> 13 KSP Residual norm 6.521331095889e-02
>>>> 14 KSP Residual norm 5.095787952595e-02
>>>> 15 KSP Residual norm 4.043060387395e-02
>>>> 16 KSP Residual norm 3.232590200012e-02
>>>> 17 KSP Residual norm 2.593944982216e-02
>>>> 18 KSP Residual norm 2.064639483533e-02
>>>> 19 KSP Residual norm 1.653916663492e-02
>>>> 20 KSP Residual norm 1.334946415452e-02
>>>> 21 KSP Residual norm 1.092886880597e-02
>>>> 22 KSP Residual norm 8.988004105542e-03
>>>> 23 KSP Residual norm 7.466501315240e-03
>>>> 24 KSP Residual norm 6.284389135436e-03
>>>> 25 KSP Residual norm 5.425231669964e-03
>>>> 26 KSP Residual norm 4.766338253084e-03
>>>> 27 KSP Residual norm 4.241238878242e-03
>>>> 28 KSP Residual norm 3.808113525685e-03
>>>> 29 KSP Residual norm 3.449383788116e-03
>>>> 30 KSP Residual norm 3.126025526388e-03
>>>> 31 KSP Residual norm 2.958328054299e-03
>>>> 32 KSP Residual norm 2.802344900403e-03
>>>> 33 KSP Residual norm 2.621993580492e-03
>>>> 34 KSP Residual norm 2.430066269304e-03
>>>> 35 KSP Residual norm 2.259043079597e-03
>>>> 36 KSP Residual norm 2.104287972986e-03
>>>> 37 KSP Residual norm 1.952916080045e-03
>>>> 38 KSP Residual norm 1.804988937999e-03
>>>> 39 KSP Residual norm 1.643302117377e-03
>>>> 40 KSP Residual norm 1.471661332036e-03
>>>> 41 KSP Residual norm 1.286445911163e-03
>>>> 42 KSP Residual norm 1.127543025848e-03
>>>> 43 KSP Residual norm 9.777148275484e-04
>>>> 44 KSP Residual norm 8.293314450006e-04
>>>> 45 KSP Residual norm 6.989331136622e-04
>>>> 46 KSP Residual norm 5.852307780220e-04
>>>> 47 KSP Residual norm 4.926715539762e-04
>>>> 48 KSP Residual norm 4.215941372075e-04
>>>> 49 KSP Residual norm 3.699489548162e-04
>>>> 50 KSP Residual norm 3.293897163533e-04
>>>> 51 KSP Residual norm 2.959954542998e-04
>>>> 52 KSP Residual norm 2.700193032414e-04
>>>> 53 KSP Residual norm 2.461789791204e-04
>>>> 54 KSP Residual norm 2.218839085563e-04
>>>> 55 KSP Residual norm 1.945154309976e-04
>>>> 56 KSP Residual norm 1.661128781744e-04
>>>> 57 KSP Residual norm 1.413198766258e-04
>>>> 58 KSP Residual norm 1.213984003195e-04
>>>> 59 KSP Residual norm 1.044317450754e-04
>>>> 60 KSP Residual norm 8.919957502977e-05
>>>> 61 KSP Residual norm 8.042584301275e-05
>>>> 62 KSP Residual norm 7.292784493581e-05
>>>> 63 KSP Residual norm 6.481935501872e-05
>>>> 64 KSP Residual norm 5.718564652679e-05
>>>> 65 KSP Residual norm 5.072589750116e-05
>>>> 66 KSP Residual norm 4.487930741285e-05
>>>> 67 KSP Residual norm 3.941040674119e-05
>>>> 68 KSP Residual norm 3.492873281291e-05
>>>> 69 KSP Residual norm 3.103798339845e-05
>>>> 70 KSP Residual norm 2.822943237409e-05
>>>> 71 KSP Residual norm 2.610615023776e-05
>>>> 72 KSP Residual norm 2.441692671173e-05
>>>> KSP Object: 1 MPI processes
>>>>  type: gmres
>>>>    restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization 
>>>> with no iterative refinement
>>>>    happy breakdown tolerance 1e-30
>>>>  maximum iterations=150, nonzero initial guess
>>>>  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>>  left preconditioning
>>>>  using PRECONDITIONED norm type for convergence test
>>>> PC Object: 1 MPI processes
>>>>  type: bjacobi
>>>>    number of blocks = 3
>>>>    Local solve is same for all blocks, in the following KSP and PC objects:
>>>>    KSP Object: (sub_) 1 MPI processes
>>>>      type: preonly
>>>>      maximum iterations=10000, initial guess is zero
>>>>      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>>      left preconditioning
>>>>      using NONE norm type for convergence test
>>>>    PC Object: (sub_) 1 MPI processes
>>>>      type: ilu
>>>>        out-of-place factorization
>>>>        0 levels of fill
>>>>        tolerance for zero pivot 2.22045e-14
>>>>        matrix ordering: natural
>>>>        factor fill ratio given 1., needed 1.
>>>>          Factored matrix follows:
>>>>            Mat Object: 1 MPI processes
>>>>              type: seqaij
>>>>              rows=92969, cols=92969
>>>>              package used to perform factorization: petsc
>>>>              total: nonzeros=638417, allocated nonzeros=638417
>>>>              total number of mallocs used during MatSetValues calls=0
>>>>                not using I-node routines
>>>>      linear system matrix = precond matrix:
>>>>      Mat Object: 1 MPI processes
>>>>        type: seqaij
>>>>        rows=92969, cols=92969
>>>>        total: nonzeros=638417, allocated nonzeros=638417
>>>>        total number of mallocs used during MatSetValues calls=0
>>>>          not using I-node routines
>>>>  linear system matrix = precond matrix:
>>>>  Mat Object: 1 MPI processes
>>>>    type: mpiaij
>>>>    rows=278906, cols=278906
>>>>    total: nonzeros=3274027, allocated nonzeros=3274027
>>>>    total number of mallocs used during MatSetValues calls=0
>>>>      not using I-node (on process 0) routines
>>>> ...
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop                        
>>>>       --- Global ---  --- Stage ----  Total
>>>>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> --- Event Stage 0: Main Stage
>>>> 
>>>> BuildTwoSidedF         1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatMult               75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 28 24  0  0  0  28 24  0  0  0  1189
>>>> MatSolve             228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   944
>>> 
>>>  These flop rates are ok, but not great. Perhaps an older machine.
>>> 
>>>> MatLUFactorNum         3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0   290
>>>> MatILUFactorSym        3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAssemblyBegin       5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatAssemblyEnd         5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatGetRowIJ            3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatCreateSubMats       1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
>>>> MatGetOrdering         3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> MatView                3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecMDot               72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 21 28  0  0  0  21 28  0  0  0  1877
>>> 
>>>   21 percent of the time in VecMDOT this is huge for s sequential fun. I 
>>> think maybe you are using a terrible OpenMP BLAS? 
>>> 
>>>   Send configure.log 
>>> 
>>> 
>>>> VecNorm               76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  2  0  0  0   3  2  0  0  0  1007
>>>> VecScale              75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  1  0  0  0   0  1  0  0  0  4587
>>>> VecCopy                3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecSet               276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
>>>> VecAXPY                6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0  3575
>>>> VecMAXPY              75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 18 30  0  0  0  18 30  0  0  0  2405
>>>> VecAssemblyBegin       2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecAssemblyEnd         2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecScatterBegin       76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> VecNormalize          75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  3  0  0  0   3  3  0  0  0  1360
>>>> KSPSetUp               4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>>>> KSPSolve               1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 93100  0  0  0  93100  0  0  0  1489
>>>> KSPGMRESOrthog        72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 37 57  0  0  0  37 57  0  0  0  2110
>>>> PCSetUp                4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  3  0  0  0  0   3  0  0  0  0   153
>>>> PCSetUpOnBlocks        1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00  2  0  0  0  0   2  0  0  0  0   209
>>>> PCApply               76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>>> PCApplyOnBlocks      228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 
>>>> 0.0e+00 20 14  0  0  0  20 14  0  0  0   925
>>>> ————————————————————————————————————————————————————————————
>>>> ...
>>>> #PETSc Option Table entries:
>>>> -ksp_monitor
>>>> -ksp_type gmres
>>>> -ksp_view
>>>> -log_view
>>>> -pc_bjacobi_local_blocks 3
>>>> -pc_sub_type lu
>>>> -pc_type bjacobi
>>>> #End of PETSc Option Table entries
>>>> ...
>>>> 
>>>> Does any of the setup/output ring a bell? 
>>>> 
>>>> BTW, out of curiosity - what is a “I-node” routine?
>>>> 
>>>> 
>>>> Cheers, 
>>>> Hao
>>>> 
>>>> 
>>>> From: Smith, Barry F. <[email protected]>
>>>> Sent: Wednesday, February 5, 2020 9:42 PM
>>>> To: Hao DONG <[email protected]>
>>>> Cc: [email protected] <[email protected]>
>>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>>> Diagonal ILU as PC?
>>>> 
>>>> 
>>>> 
>>>>> On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
>>>>> 
>>>>> Thanks a lot for your suggestions, Hong and Barry - 
>>>>> 
>>>>> As you suggested, I first tried the LU direct solvers (built-in and 
>>>>> MUMPs) out this morning, which work perfectly, albeit slow. As my problem 
>>>>> itself is a part of a PDE based optimization, the exact solution in the 
>>>>> searching procedure is not necessary (I often set a relative tolerance of 
>>>>> 1E-7/1E-8, etc. for Krylov subspace methods). Also tried BJACOBI with 
>>>>> exact LU, the KSP just converges in one or two iterations, as expected. 
>>>>> 
>>>>> I added -kspview option for the D-ILU test (still with Block Jacobi as 
>>>>> preconditioner and bcgs as KSP solver). The KSPview output from one of 
>>>>> the examples in a toy model looks like: 
>>>>> 
>>>>> KSP Object: 1 MPI processes
>>>>>  type: bcgs
>>>>>  maximum iterations=120, nonzero initial guess
>>>>>  tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
>>>>>  left preconditioning
>>>>>  using PRECONDITIONED norm type for convergence test
>>>>> PC Object: 1 MPI processes
>>>>>  type: bjacobi
>>>>>    number of blocks = 3
>>>>>    Local solve is same for all blocks, in the following KSP and PC 
>>>>> objects:
>>>>>    KSP Object: (sub_) 1 MPI processes
>>>>>      type: preonly
>>>>>      maximum iterations=10000, initial guess is zero
>>>>>      tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
>>>>>      left preconditioning
>>>>>      using NONE norm type for convergence test
>>>>>    PC Object: (sub_) 1 MPI processes
>>>>>      type: ilu
>>>>>        out-of-place factorization
>>>>>        0 levels of fill
>>>>>        tolerance for zero pivot 2.22045e-14
>>>>>        matrix ordering: natural
>>>>>        factor fill ratio given 1., needed 1.
>>>>>          Factored matrix follows:
>>>>>            Mat Object: 1 MPI processes
>>>>>              type: seqaij
>>>>>              rows=11294, cols=11294
>>>>>              package used to perform factorization: petsc
>>>>>              total: nonzeros=76008, allocated nonzeros=76008
>>>>>              total number of mallocs used during MatSetValues calls=0
>>>>>                not using I-node routines
>>>>>      linear system matrix = precond matrix:
>>>>>      Mat Object: 1 MPI processes
>>>>>        type: seqaij
>>>>>        rows=11294, cols=11294
>>>>>        total: nonzeros=76008, allocated nonzeros=76008
>>>>>        total number of mallocs used during MatSetValues calls=0
>>>>>          not using I-node routines
>>>>>  linear system matrix = precond matrix:
>>>>>  Mat Object: 1 MPI processes
>>>>>    type: mpiaij
>>>>>    rows=33880, cols=33880
>>>>>    total: nonzeros=436968, allocated nonzeros=436968
>>>>>    total number of mallocs used during MatSetValues calls=0
>>>>>      not using I-node (on process 0) routines
>>>>> 
>>>>> do you see something wrong with my setup?
>>>>> 
>>>>> I also tried a quick performance test with a small 278906 by 278906 
>>>>> matrix (3850990 nnz) with the following parameters: 
>>>>> 
>>>>> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type 
>>>>> ilu -ksp_view
>>>>> 
>>>>> Reducing the relative residual to 1E-7 
>>>>> 
>>>>> Took 4.08s with 41 bcgs iterations. 
>>>>> 
>>>>> Merely changing the -pc_bjacobi_local_blocks to 6 
>>>>> 
>>>>> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s 
>>>>> with 101 bcgs iterations.
>>>> 
>>>>   This is normal. more blocks slower convergence
>>>>> 
>>>>> As a reference, my home-brew Fortran code solves the same problem 
>>>>> (3-block D-ILU0) in 
>>>>> 
>>>>> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
>>>>> 
>>>>   Run the PETSc code with optimization ./configure --with-debugging=0  an 
>>>> run the code with -log_view this will show where the PETSc code is 
>>>> spending the time (send it to use)
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if 
>>>>> a user is allowed to provide his own (separate) L and U Mat for 
>>>>> preconditioning (like how it works in Matlab solvers), e.g.
>>>>> 
>>>>> x = qmr(A,b,Tol,MaxIter,L,U,x)
>>>>> 
>>>>> As I already explicitly constructed the L and U matrices in Fortran, it 
>>>>> is not hard to convert them to Mat in Petsc to test Petsc and my Fortran 
>>>>> code head-to-head. In that case, the A, b, x, and L/U are all identical, 
>>>>> it would be easier to see where the problem came from. 
>>>>> 
>>>>> 
>>>>    No, we don't provide this kind of support
>>>> 
>>>> 
>>>>> 
>>>>> BTW, there is another thing I wondered - is there a way to output 
>>>>> residual in unpreconditioned norm? I tried to
>>>>> 
>>>>> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
>>>>> 
>>>>> But always get an error that current ksp does not support 
>>>>> unpreconditioned in LEFT/RIGHT (either way). Is it possible to do that 
>>>>> (output unpreconditioned residual) in PETSc at all?
>>>> 
>>>>  -ksp_monitor_true_residual    You can also run GMRES (and some other 
>>>> methods) with right preconditioning, -ksp_pc_side right  then the residual 
>>>> computed is by the algorithm the unpreconditioned residual
>>>> 
>>>>  KSPSetNormType sets the norm used in the algorithm, it generally always 
>>>> has to left or right, only a couple algorithms support unpreconditioned 
>>>> directly.
>>>> 
>>>>  Barry
>>>> 
>>>> 
>>>>> 
>>>>> Cheers, 
>>>>> Hao
>>>>> 
>>>>> 
>>>>> From: Smith, Barry F. <[email protected]>
>>>>> Sent: Tuesday, February 4, 2020 8:27 PM
>>>>> To: Hao DONG <[email protected]>
>>>>> Cc: [email protected] <[email protected]>
>>>>> Subject: Re: [petsc-users] What is the right way to implement a (block) 
>>>>> Diagonal ILU as PC?
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
>>>>>> 
>>>>>> Dear all, 
>>>>>> 
>>>>>> 
>>>>>> I have a few questions about the implementation of diagonal ILU PC in 
>>>>>> PETSc. I want to solve a very simple system with KSP (in parallel), the 
>>>>>> nature of the system (finite difference time-harmonic Maxwell) is 
>>>>>> probably not important to the question itself. Long story short, I just 
>>>>>> need to solve a set of equations of Ax = b with a block diagonal system 
>>>>>> matrix, like (not sure if the mono font works): 
>>>>>> 
>>>>>>  |X    |  
>>>>>> A =|  Y  |  
>>>>>>  |    Z| 
>>>>>> 
>>>>>> Note that A is not really block-diagonal, it’s just a multi-diagonal 
>>>>>> matrix determined by the FD mesh, where most elements are close to 
>>>>>> diagonal. So instead of a full ILU decomposition, a D-ILU is a good 
>>>>>> approximation as a preconditioner, and the number of blocks should not 
>>>>>> matter too much: 
>>>>>> 
>>>>>>   |Lx      |         |Ux      |
>>>>>> L = |   Ly   | and U = |   Uy   |
>>>>>>   |      Lz|         |      Uz|
>>>>>> 
>>>>>> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
>>>>>> blocks) is quite efficient with Krylov subspace methods like BiCGstab or 
>>>>>> QMR in my sequential Fortran/Matlab code. 
>>>>>> 
>>>>>> So like most users, I am looking for a parallel implement with this 
>>>>>> problem in PETSc. After looking through the manual, it seems that the 
>>>>>> most straightforward way to do it is through PCBJACOBI. Not sure I 
>>>>>> understand it right, I just setup a 3-block PCJACOBI and give each of 
>>>>>> the block a KSP with PCILU. Is this supposed to be equivalent to my 
>>>>>> D-ILU preconditioner? My little block of fortran code would look like: 
>>>>>> ...
>>>>>>     call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
>>>>>>    &     isubs,ierr)
>>>>>>     call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
>>>>>>    &    isubs(istart:iend),ierr)
>>>>>>     ! set up the block jacobi structure
>>>>>>     call KSPSetup(ksp_local,ierr)
>>>>>>     ! allocate sub ksps
>>>>>>     allocate(ksp_sub(Nsub))
>>>>>>     call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
>>>>>>    &     ksp_sub,ierr)
>>>>>>     do i=1,Nsub
>>>>>>         call KSPGetPC(ksp_sub(i),pc_sub,ierr)
>>>>>>         !ILU preconditioner
>>>>>>         call PCSetType(pc_sub,ptype,ierr)
>>>>>>         call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
>>>>>>         call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
>>>>>>     end do
>>>>>>     call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
>>>>>>    &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
>>>>>> … 
>>>>> 
>>>>>    This code looks essentially right. You should call with -ksp_view to 
>>>>> see exactly what PETSc is using for a solver. 
>>>>> 
>>>>>> 
>>>>>> I understand that the parallel performance may not be comparable, so I 
>>>>>> first set up a one-process test (with MPIAij, but all the rows are local 
>>>>>> since there is only one process). The system is solved without any 
>>>>>> problem (identical results within error). But the performance is 
>>>>>> actually a lot worse (code built without debugging flags in performance 
>>>>>> tests) than my own home-brew implementation in Fortran (I wrote my own 
>>>>>> ILU0 in CSR sparse matrix format), which is hard to believe. I suspect 
>>>>>> the difference is from the PC as the PETSc version took much more 
>>>>>> BiCGstab iterations (60-ish vs 100-ish) to converge to the same relative 
>>>>>> tol. 
>>>>> 
>>>>>  PETSc uses GMRES by default with a restart of 30 and left 
>>>>> preconditioning. It could be different exact choices in the solver (which 
>>>>> is why -ksp_view is so useful) can explain the differences in the runs 
>>>>> between your code and PETSc's
>>>>>> 
>>>>>> This is further confirmed when I change the setup of D-ILU (using 6 or 9 
>>>>>> blocks instead of 3). While my Fortran/Matlab codes see minimal 
>>>>>> performance difference (<5%) when I play with the D-ILU setup, 
>>>>>> increasing the number of D-ILU blocks from 3 to 9 caused the ksp setup 
>>>>>> with PCBJACOBI to suffer a performance decrease of more than 50% in 
>>>>>> sequential test.
>>>>> 
>>>>>  This is odd, the more blocks the smaller each block so the quicker the 
>>>>> ILU set up should be. You can run various cases with -log_view and send 
>>>>> us the output to see what is happening at each part of the computation in 
>>>>> time.
>>>>> 
>>>>>> So my implementation IS somewhat different in PETSc. Do I miss something 
>>>>>> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding 
>>>>>> of how PCBJACOBI works in PETSc? 
>>>>> 
>>>>>  Probably not.
>>>>>> 
>>>>>> If this is not the right way to implement a block diagonal ILU as 
>>>>>> (parallel) PC, please kindly point me to the right direction. I searched 
>>>>>> through the mail list to find some answers, only to find a couple of 
>>>>>> similar questions... An example would be nice.
>>>>> 
>>>>>  You approach seems fundamentally right but I cannot be sure of possible 
>>>>> bugs.
>>>>>> 
>>>>>> On the other hand, does PETSc support a simple way to use explicit L/U 
>>>>>> matrix as a preconditioner? I can import the  D-ILU matrix (I already 
>>>>>> converted my A matrix into Mat) constructed in my Fortran code to make a 
>>>>>> better comparison. Or do I have to construct my own PC using PCshell? If 
>>>>>> so, is there a good tutorial/example to learn about how to use PCSHELL 
>>>>>> (in a more advanced sense, like how to setup pc side and transpose)? 
>>>>> 
>>>>>  Not sure what you mean by explicit L/U matrix as a preconditioner. As 
>>>>> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or 
>>>>> Pastix as the solver. You do not need any shell code. You simply need to 
>>>>> set the PCType to lu 
>>>>> 
>>>>>  You can also set all this options from the command line and don't need 
>>>>> to write the code specifically. So call KSPSetFromOptions() and then for 
>>>>> example
>>>>> 
>>>>>   -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this 
>>>>> last one is applied to each block so you could use -pc_type lu and it 
>>>>> would use lu on each block.) 
>>>>> 
>>>>>  -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do 
>>>>> parallel LU with mumps)
>>>>> 
>>>>> By not hardwiring in the code and just using options you can test out 
>>>>> different cases much quicker
>>>>> 
>>>>> Use -ksp_view to make sure that is using the solver the way you expect.
>>>>> 
>>>>> Barry
>>>>> 
>>>>> 
>>>>> 
>>>>>  Barry
>>>>> 
>>>>>> 
>>>>>> Thanks in advance, 
>>>>>> 
>>>>>> Hao
>>> 
>>> <configure.log><configure.log>
>> 
> 

Reply via email to