You should google for preconditioners/solvers for 3D time-harmonic Maxwell
equation arises from stagger-grid finite difference. Maxwell has its own
particular structure and difficulties with iterative solvers depending on the
parameters. Also see page 87 in
https://www.mcs.anl.gov/petsc/petsc-current/docs/manual.pdf
Barry
-pc_type asm may work well for a handful of processors.
> On Feb 10, 2020, at 8:47 AM, Hao DONG <[email protected]> wrote:
>
> Hi Barry,
>
> Thank you for you suggestions (and insights)! Indeed my initial motivation to
> try out PETSc is the different methods. As my matrix pattern is relatively
> simple (3D time-harmonic Maxwell equation arises from stagger-grid finite
> difference), also considering the fact that I am not wealthy enough to
> utilize the direct solvers, I was looking for a fast Krylov subspace method /
> preconditioner that scale well with, say, tens of cpu cores.
>
> As a simple block-Jacobian preconditioner seems to lose its efficiency with
> more than a handful of blocks, I planned to look into other
> methods/preconditioners, e.g. multigrid (as preconditioner) or domain
> decomposition methods. But I will probably need to look through a number of
> literatures before laying my hands on those (or bother you with more
> questions!). Anyway, thanks again for your kind help.
>
>
> All the best,
> Hao
>
>> On Feb 8, 2020, at 8:02 AM, Smith, Barry F. <[email protected]> wrote:
>>
>>
>>
>>> On Feb 7, 2020, at 7:44 AM, Hao DONG <[email protected]> wrote:
>>>
>>> Thanks, Barry, I really appreciate your help -
>>>
>>> I removed the OpenMP flags and rebuilt PETSc. So the problem is from the
>>> BLAS lib I linked?
>>
>> Yes, the openmp causes it to run in parallel, but the problem is not big
>> enough and the machine is not good enough for parallel BLAS to speed things
>> up, instead it slows things down a lot. We see this often, parallel BLAS
>> must be used with care
>>
>>> Not sure which version my BLAS is, though… But I also included the
>>> -download-Scalapack option. Shouldn’t that enable linking with PBLAS
>>> automatically?
>>>
>>> After looking at the bcgs code in PETSc, I suppose the iteration residual
>>> recorded is indeed recorded twice per one "actual iteration”. So that can
>>> explain the difference of iteration numbers.
>>>
>>> My laptop is indeed an old machine (MBP15 mid-2014). I just cannot work
>>> with vi without a physical ESC key...
>>
>> The latest has a physical ESC, I am stuff without the ESC for a couple more
>> years.
>>
>>> I have attached the configure.log -didn’t know that it is so large!
>>>
>>> Anyway, it seems that the removal of -openmp changes quite a lot of things,
>>> the performance is indeed getting much better - the flop/sec increases by a
>>> factor of 3. Still, I am getting 20 percent of VecMDot, but no VecMDot in
>>> BCGS all (see output below), is that a feature of gmres method?
>>
>> Yes, GMRES orthogonalizes against the last restart directions which uses
>> these routines while BCGS does not, this is why BCGS is cheaper per
>> iteration.
>>
>> PETSc is no faster than your code because the algorithm is the same, the
>> compilers the same, and the hardware the same. No way to have clever tricks
>> for PETSc to be much faster. What PETS provides is a huge variety of tested
>> algorithms that no single person could code on their own. Anything in PETSc
>> you could code yourself if you had endless time and get basically the same
>> performance.
>>
>> Barry
>>
>>
>>>
>>> here is the output of the same problem with:
>>>
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type
>>> gmres -ksp_monitor -ksp_view
>>>
>>>
>>> ---------------------------------------------- PETSc Performance Summary:
>>> ----------------------------------------------
>>>
>>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by
>>> donghao Fri Feb 7 10:26:19 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 2.520e+00 1.000 2.520e+00
>>> Objects: 1.756e+03 1.000 1.756e+03
>>> Flop: 7.910e+09 1.000 7.910e+09 7.910e+09
>>> Flop/sec: 3.138e+09 1.000 3.138e+09 3.138e+09
>>> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
>>> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
>>> MPI Reductions: 0.000e+00 0.000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N -->
>>> 2N flop
>>> and VecAXPY() for complex vectors of length N
>>> --> 8N flop
>>>
>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
>>> -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total Count %Total
>>> Avg %Total Count %Total
>>> 0: Main Stage: 2.5204e+00 100.0% 7.9096e+09 100.0% 0.000e+00 0.0%
>>> 0.000e+00 0.0% 0.000e+00 0.0%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> …
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flop
>>> --- Global --- --- Stage ---- Total
>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSidedF 1 1.0 3.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatMult 75 1.0 6.2884e-01 1.0 1.88e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 25 24 0 0 0 25 24 0 0 0 2991
>>> MatSolve 228 1.0 4.4164e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2445
>>> MatLUFactorNum 3 1.0 4.1317e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 574
>>> MatILUFactorSym 3 1.0 2.3858e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatAssemblyBegin 5 1.0 4.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatAssemblyEnd 5 1.0 1.5067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatGetRowIJ 3 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatCreateSubMats 1 1.0 2.4558e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatGetOrdering 3 1.0 1.3290e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatView 3 1.0 1.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecMDot 72 1.0 4.9875e-01 1.0 2.25e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 20 28 0 0 0 20 28 0 0 0 4509
>>> VecNorm 76 1.0 6.6666e-02 1.0 1.70e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 2 0 0 0 3 2 0 0 0 2544
>>> VecScale 75 1.0 1.7982e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 1 0 0 0 1 1 0 0 0 4653
>>> VecCopy 3 1.0 1.5080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 276 1.0 9.6784e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0
>>> VecAXPY 6 1.0 3.6860e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3632
>>> VecMAXPY 75 1.0 4.0490e-01 1.0 2.41e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 16 30 0 0 0 16 30 0 0 0 5951
>>> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecAssemblyEnd 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecScatterBegin 76 1.0 5.3800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecNormalize 75 1.0 8.3690e-02 1.0 2.51e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 3 0 0 0 3 3 0 0 0 2999
>>> KSPSetUp 4 1.0 1.1663e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPSolve 1 1.0 2.2119e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 88100 0 0 0 88100 0 0 0 3576
>>> KSPGMRESOrthog 72 1.0 8.7843e-01 1.0 4.50e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 35 57 0 0 0 35 57 0 0 0 5121
>>> PCSetUp 4 1.0 9.2448e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 4 0 0 0 0 4 0 0 0 0 257
>>> PCSetUpOnBlocks 1 1.0 6.6597e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 0 0 0 0 3 0 0 0 0 356
>>> PCApply 76 1.0 4.6281e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2333
>>> PCApplyOnBlocks 228 1.0 4.6262e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2334
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> Average time to get PetscTime(): 1e-07
>>> #PETSc Option Table entries:
>>> -I LBFGS
>>> -ksp_type gmres
>>> -ksp_view
>>> -log_view
>>> -pc_bjacobi_local_blocks 3
>>> -pc_type bjacobi
>>> -sub_pc_type ilu
>>> #End of PETSc Option Table entries
>>> Compiled with FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>> Configure options: --with-scalar-type=complex --download-mumps
>>> --download-scalapack --with-fortran-kernels=1 -- FOPTFLAGS="-O3
>>> -ffree-line-length-0 -msse2" --COPTFLAGS="-O3 -msse2" --CXXOPTFLAGS="-O3
>>> -msse2" --with-debugging=0
>>> -----------------------------------------
>>> Libraries compiled on 2020-02-07 10:07:42 on Haos-MBP
>>> Machine characteristics: Darwin-19.3.0-x86_64-i386-64bit
>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>> Using PETSc arch: arch-darwin-c-opt
>>> -----------------------------------------
>>>
>>> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
>>> -Wno-unknown-pragmas -fstack-protector -fno-stack- check
>>> -Qunused-arguments -fvisibility=hidden -O3 -msse2
>>> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0
>>> -Wno-unused-dummy-argument -O3 -ffree-line-length-0 - msse2
>>> -----------------------------------------
>>>
>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include
>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin- c-opt/include
>>> -----------------------------------------
>>>
>>> Using C linker: mpicc
>>> Using Fortran linker: mpif90
>>> Using libraries:
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>> -L/Users/donghao/src/git/PETSc- current/arch-darwin-c-opt/lib -lpetsc
>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>> -L/Users/ donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/ lib
>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib
>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/
>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-
>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9
>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps - ldmumps -lsmumps
>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl
>>> -lmpi_usempif08 - lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>> -----------------------------------------
>>>
>>>
>>>
>>> The BCGS solver performance is now comparable to my own Fortran code
>>> (1.84s). Still, I feel that there is something wrong hidden somewhere in my
>>> setup - a professional lib should to perform better, I believe. Any other
>>> ideas that I can look into? Interestingly there is no VecMDot operation at
>>> all! Here is the output with the option of:
>>>
>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type bcgs
>>> -ksp_monitor -ksp_view
>>>
>>>
>>>
>>>
>>> ---------------------------------------------- PETSc Performance Summary:
>>> ----------------------------------------------
>>>
>>> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by
>>> donghao Fri Feb 7 10:38:00 2020
>>> Using Petsc Release Version 3.12.3, unknown
>>>
>>> Max Max/Min Avg Total
>>> Time (sec): 2.187e+00 1.000 2.187e+00
>>> Objects: 1.155e+03 1.000 1.155e+03
>>> Flop: 4.311e+09 1.000 4.311e+09 4.311e+09
>>> Flop/sec: 1.971e+09 1.000 1.971e+09 1.971e+09
>>> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
>>> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
>>> MPI Reductions: 0.000e+00 0.000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>> e.g., VecAXPY() for real vectors of length N -->
>>> 2N flop
>>> and VecAXPY() for complex vectors of length N
>>> --> 8N flop
>>>
>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
>>> -- Message Lengths -- -- Reductions --
>>> Avg %Total Avg %Total Count %Total
>>> Avg %Total Count %Total
>>> 0: Main Stage: 2.1870e+00 100.0% 4.3113e+09 100.0% 0.000e+00 0.0%
>>> 0.000e+00 0.0% 0.000e+00 0.0%
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event Count Time (sec) Flop
>>> --- Global --- --- Stage ---- Total
>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSidedF 1 1.0 2.2000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatMult 83 1.0 7.8726e-01 1.0 2.08e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 36 48 0 0 0 36 48 0 0 0 2644
>>> MatSolve 252 1.0 5.5656e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 25 28 0 0 0 25 28 0 0 0 2144
>>> MatLUFactorNum 3 1.0 4.5115e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 1 0 0 0 2 1 0 0 0 526
>>> MatILUFactorSym 3 1.0 2.5103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatAssemblyBegin 5 1.0 3.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatAssemblyEnd 5 1.0 1.5709e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatGetRowIJ 3 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatCreateSubMats 1 1.0 2.8989e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> MatGetOrdering 3 1.0 1.1200e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> MatView 3 1.0 1.2600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecDot 82 1.0 8.9328e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 4 4 0 0 0 4 4 0 0 0 2048
>>> VecDotNorm2 41 1.0 9.9019e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 5 4 0 0 0 5 4 0 0 0 1848
>>> VecNorm 43 1.0 3.9988e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 2 0 0 0 2 2 0 0 0 2399
>>> VecCopy 2 1.0 1.1150e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecSet 271 1.0 4.2833e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>> VecAXPY 1 1.0 5.9200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3769
>>> VecAXPBYCZ 82 1.0 1.1448e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 5 8 0 0 0 5 8 0 0 0 3196
>>> VecWAXPY 82 1.0 6.7460e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 4 0 0 0 3 4 0 0 0 2712
>>> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> VecScatterBegin 84 1.0 5.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>> KSPSetUp 4 1.0 1.4765e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>> KSPSolve 1 1.0 1.8514e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 85100 0 0 0 85100 0 0 0 2329
>>> PCSetUp 4 1.0 1.0193e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 5 1 0 0 0 5 1 0 0 0 233
>>> PCSetUpOnBlocks 1 1.0 7.1421e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 3 1 0 0 0 3 1 0 0 0 332
>>> PCApply 84 1.0 5.7927e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 26 28 0 0 0 26 28 0 0 0 2060
>>> PCApplyOnBlocks 252 1.0 5.7902e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>> 0.0e+00 26 28 0 0 0 26 28 0 0 0 2061
>>> ------------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> Cheers,
>>> Hao
>>>
>>>
>>>
>>> From: Smith, Barry F. <[email protected]>
>>> Sent: Thursday, February 6, 2020 7:03 PM
>>> To: Hao DONG <[email protected]>
>>> Cc: [email protected] <[email protected]>
>>> Subject: Re: [petsc-users] What is the right way to implement a (block)
>>> Diagonal ILU as PC?
>>>
>>>
>>> Read my comments ALL the way down, they go a long way.
>>>
>>>> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote:
>>>>
>>>> Dear Hong and Barry,
>>>>
>>>> Thanks for the suggestions. So there could be some problems in my PETSc
>>>> configuration? - but my PETSc lib was indeed compiled without the debug
>>>> flags (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for
>>>> the compiling and building of PETSc and my own fortran code. My Fortran
>>>> compiling flags are simply like:
>>>>
>>>> -O3 -ffree-line-length-none -fastsse
>>>>
>>>> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but
>>>> not my fortran code, as I don’t have any OMP optimizations in my program).
>>>> Note the performance test results I listed yesterday (e.g. 4.08s with 41
>>>> bcgs iterations.) are without any CSR-array->PETSc translation overhead
>>>> (only include the set and solve part).
>>>
>>> PETSc doesn't use -openmp in any way for its solvers. Do not use this
>>> option, it may be slowing the code down. Please send configure.log
>>>
>>>>
>>>> I have two questions about the performance difference -
>>>>
>>>> 1. Is ilu only factorized once for each iteration, or ilu is performed at
>>>> every outer ksp iteration steps? Sounds unlikely - but if so, this could
>>>> cause some extra overheads.
>>>
>>> ILU is ONLY done if the matrix has changed (which seems wrong).
>>>>
>>>> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each
>>>> iteration step. Not sure how it works in PETSc, but is that possible that
>>>> both the two “half" relative residuals are counted in the output array,
>>>> doubling the number of iterations (but that cannot explain the extra time
>>>> consumed)?
>>>
>>> Yes, PETSc might report them as two, you need to check the exact code.
>>>
>>>>
>>>> Anyway, the output with -log_view from the same 278906 by 278906 matrix
>>>> with 3-block D-ILU in PETSc is as follows:
>>>>
>>>>
>>>> ---------------------------------------------- PETSc Performance Summary:
>>>> ----------------------------------------------
>>>>
>>>> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by
>>>> donghao Thu Feb 6 09:07:35 2020
>>>> Using Petsc Release Version 3.12.3, unknown
>>>>
>>>> Max Max/Min Avg Total
>>>> Time (sec): 4.443e+00 1.000 4.443e+00
>>>> Objects: 1.155e+03 1.000 1.155e+03
>>>> Flop: 4.311e+09 1.000 4.311e+09 4.311e+09
>>>> Flop/sec: 9.703e+08 1.000 9.703e+08 9.703e+08
>>>> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
>>>> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
>>>> MPI Reductions: 0.000e+00 0.000
>>>>
>>>> Flop counting convention: 1 flop = 1 real number operation of type
>>>> (multiply/divide/add/subtract)
>>>> e.g., VecAXPY() for real vectors of length N
>>>> --> 2N flop
>>>> and VecAXPY() for complex vectors of length N
>>>> --> 8N flop
>>>>
>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages
>>>> --- -- Message Lengths -- -- Reductions --
>>>> Avg %Total Avg %Total Count %Total
>>>> Avg %Total Count %Total
>>>> 0: Main Stage: 4.4435e+00 100.0% 4.3113e+09 100.0% 0.000e+00 0.0%
>>>> 0.000e+00 0.0% 0.000e+00 0.0%
>>>>
>>>> ————————————————————————————————————————————————————————————
>>>> See the 'Profiling' chapter of the users' manual for details on
>>>> interpreting output.
>>>> Phase summary info:
>>>> Count: number of times phase was executed
>>>> Time and Flop: Max - maximum over all processors
>>>> Ratio - ratio of maximum to minimum over all processors
>>>> Mess: number of messages sent
>>>> AvgLen: average message length (bytes)
>>>> Reduct: number of global reductions
>>>> Global: entire computation
>>>> Stage: stages of a computation. Set stages with PetscLogStagePush() and
>>>> PetscLogStagePop().
>>>> %T - percent time in this phase %F - percent flop in this
>>>> phase
>>>> %M - percent messages in this phase %L - percent message lengths
>>>> in this phase
>>>> %R - percent reductions in this phase
>>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
>>>> all processors)
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event Count Time (sec) Flop
>>>> --- Global --- --- Stage ---- Total
>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatMult 83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 40 48 0 0 0 40 48 0 0 0 1168
>>>> MatSolve 252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 939
>>>> MatLUFactorNum 3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 2 1 0 0 0 2 1 0 0 0 298
>>>> MatILUFactorSym 3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> MatAssemblyBegin 5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatAssemblyEnd 5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatCreateSubMats 1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> MatGetOrdering 3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatView 3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecDot 82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 4 4 0 0 0 4 4 0 0 0 966
>>>> VecDotNorm2 41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 4 4 0 0 0 4 4 0 0 0 1088
>>>> VecNorm 43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 2 2 0 0 0 2 2 0 0 0 1009
>>>> VecCopy 2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecSet 271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> VecAXPY 1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2890
>>>> VecAXPBYCZ 82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 5 8 0 0 0 5 8 0 0 0 1502
>>>> VecWAXPY 82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 4 0 0 0 3 4 0 0 0 1293
>>>> VecAssemblyBegin 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecScatterBegin 84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> KSPSetUp 4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> KSPSolve 1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 91100 0 0 0 91100 0 0 0 1071
>>>> PCSetUp 4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 1 0 0 0 3 1 0 0 0 156
>>>> PCSetUpOnBlocks 1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 1 0 0 0 3 1 0 0 0 214
>>>> PCApply 84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924
>>>> PCApplyOnBlocks 252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> # I skipped the memory part - the options (and compiler options) are as
>>>> follows:
>>>>
>>>> #PETSc Option Table entries:
>>>> -ksp_type bcgs
>>>> -ksp_view
>>>> -log_view
>>>> -pc_bjacobi_local_blocks 3
>>>> -pc_factor_levels 0
>>>> -pc_sub_type ilu
>>>> -pc_type bjacobi
>>>> #End of PETSc Option Table entries
>>>> Compiled with FORTRAN kernels
>>>> Compiled with full precision matrices (default)
>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4
>>>> Configure options: --with-scalar-type=complex --download-mumps
>>>> --download-scalapack --with-fortran-kernels=1 -- FOPTFLAGS=“-O3
>>>> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp”
>>>> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" -- with-debugging=0
>>>> -----------------------------------------
>>>> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP
>>>> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit
>>>> Using PETSc directory: /Users/donghao/src/git/PETSc-current
>>>> Using PETSc arch: arch-darwin-c-opt
>>>> -----------------------------------------
>>>>
>>>> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing
>>>> -Wno-unknown-pragmas -fstack-protector -fno-stack- check
>>>> -Qunused-arguments -fvisibility=hidden
>>>> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0
>>>> -Wno-unused-dummy-argument
>>>>
>>>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include
>>>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include
>>>> -----------------------------------------
>>>>
>>>> Using C linker: mpicc
>>>> Using Fortran linker: mpif90
>>>> Using libraries:
>>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>>> -L/Users/donghao/src/git/PETSc- current/arch-darwin-c-opt/lib -lpetsc
>>>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>>> -L/Users/ donghao/src/git/PETSc-current/arch-darwin-c-opt/lib
>>>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/ lib
>>>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib
>>>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/
>>>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0
>>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-
>>>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9
>>>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps - ldmumps -lsmumps
>>>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl
>>>> -lmpi_usempif08 - lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
>>>> -lgfortran -lquadmath -lm -lc++ -ldl
>>>>
>>>>
>>>> On the other hand, running PETSc with
>>>>
>>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type
>>>> gmres -ksp_monitor -ksp_view -log_view
>>>>
>>>> For the same problem takes 5.37s and 72 GMRES iterations. Our previous
>>>> testings show that BiCGstab (bcgs in PETSc) is almost always the most
>>>> effective KSP solver for our non-symmetrical complex system. Strangely,
>>>> the system is still using ilu instead of lu for sub blocks. The output is
>>>> like:
>>>
>>> -sub_pc_type lu
>>>
>>>>
>>>> 0 KSP Residual norm 2.480412407430e+02
>>>> 1 KSP Residual norm 8.848059967835e+01
>>>> 2 KSP Residual norm 3.415272863261e+01
>>>> 3 KSP Residual norm 1.563045190939e+01
>>>> 4 KSP Residual norm 6.241296940043e+00
>>>> 5 KSP Residual norm 2.739710899854e+00
>>>> 6 KSP Residual norm 1.391304148888e+00
>>>> 7 KSP Residual norm 7.959262020849e-01
>>>> 8 KSP Residual norm 4.828323055231e-01
>>>> 9 KSP Residual norm 2.918529739200e-01
>>>> 10 KSP Residual norm 1.905508589557e-01
>>>> 11 KSP Residual norm 1.291541892702e-01
>>>> 12 KSP Residual norm 8.827145774707e-02
>>>> 13 KSP Residual norm 6.521331095889e-02
>>>> 14 KSP Residual norm 5.095787952595e-02
>>>> 15 KSP Residual norm 4.043060387395e-02
>>>> 16 KSP Residual norm 3.232590200012e-02
>>>> 17 KSP Residual norm 2.593944982216e-02
>>>> 18 KSP Residual norm 2.064639483533e-02
>>>> 19 KSP Residual norm 1.653916663492e-02
>>>> 20 KSP Residual norm 1.334946415452e-02
>>>> 21 KSP Residual norm 1.092886880597e-02
>>>> 22 KSP Residual norm 8.988004105542e-03
>>>> 23 KSP Residual norm 7.466501315240e-03
>>>> 24 KSP Residual norm 6.284389135436e-03
>>>> 25 KSP Residual norm 5.425231669964e-03
>>>> 26 KSP Residual norm 4.766338253084e-03
>>>> 27 KSP Residual norm 4.241238878242e-03
>>>> 28 KSP Residual norm 3.808113525685e-03
>>>> 29 KSP Residual norm 3.449383788116e-03
>>>> 30 KSP Residual norm 3.126025526388e-03
>>>> 31 KSP Residual norm 2.958328054299e-03
>>>> 32 KSP Residual norm 2.802344900403e-03
>>>> 33 KSP Residual norm 2.621993580492e-03
>>>> 34 KSP Residual norm 2.430066269304e-03
>>>> 35 KSP Residual norm 2.259043079597e-03
>>>> 36 KSP Residual norm 2.104287972986e-03
>>>> 37 KSP Residual norm 1.952916080045e-03
>>>> 38 KSP Residual norm 1.804988937999e-03
>>>> 39 KSP Residual norm 1.643302117377e-03
>>>> 40 KSP Residual norm 1.471661332036e-03
>>>> 41 KSP Residual norm 1.286445911163e-03
>>>> 42 KSP Residual norm 1.127543025848e-03
>>>> 43 KSP Residual norm 9.777148275484e-04
>>>> 44 KSP Residual norm 8.293314450006e-04
>>>> 45 KSP Residual norm 6.989331136622e-04
>>>> 46 KSP Residual norm 5.852307780220e-04
>>>> 47 KSP Residual norm 4.926715539762e-04
>>>> 48 KSP Residual norm 4.215941372075e-04
>>>> 49 KSP Residual norm 3.699489548162e-04
>>>> 50 KSP Residual norm 3.293897163533e-04
>>>> 51 KSP Residual norm 2.959954542998e-04
>>>> 52 KSP Residual norm 2.700193032414e-04
>>>> 53 KSP Residual norm 2.461789791204e-04
>>>> 54 KSP Residual norm 2.218839085563e-04
>>>> 55 KSP Residual norm 1.945154309976e-04
>>>> 56 KSP Residual norm 1.661128781744e-04
>>>> 57 KSP Residual norm 1.413198766258e-04
>>>> 58 KSP Residual norm 1.213984003195e-04
>>>> 59 KSP Residual norm 1.044317450754e-04
>>>> 60 KSP Residual norm 8.919957502977e-05
>>>> 61 KSP Residual norm 8.042584301275e-05
>>>> 62 KSP Residual norm 7.292784493581e-05
>>>> 63 KSP Residual norm 6.481935501872e-05
>>>> 64 KSP Residual norm 5.718564652679e-05
>>>> 65 KSP Residual norm 5.072589750116e-05
>>>> 66 KSP Residual norm 4.487930741285e-05
>>>> 67 KSP Residual norm 3.941040674119e-05
>>>> 68 KSP Residual norm 3.492873281291e-05
>>>> 69 KSP Residual norm 3.103798339845e-05
>>>> 70 KSP Residual norm 2.822943237409e-05
>>>> 71 KSP Residual norm 2.610615023776e-05
>>>> 72 KSP Residual norm 2.441692671173e-05
>>>> KSP Object: 1 MPI processes
>>>> type: gmres
>>>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization
>>>> with no iterative refinement
>>>> happy breakdown tolerance 1e-30
>>>> maximum iterations=150, nonzero initial guess
>>>> tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
>>>> left preconditioning
>>>> using PRECONDITIONED norm type for convergence test
>>>> PC Object: 1 MPI processes
>>>> type: bjacobi
>>>> number of blocks = 3
>>>> Local solve is same for all blocks, in the following KSP and PC objects:
>>>> KSP Object: (sub_) 1 MPI processes
>>>> type: preonly
>>>> maximum iterations=10000, initial guess is zero
>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
>>>> left preconditioning
>>>> using NONE norm type for convergence test
>>>> PC Object: (sub_) 1 MPI processes
>>>> type: ilu
>>>> out-of-place factorization
>>>> 0 levels of fill
>>>> tolerance for zero pivot 2.22045e-14
>>>> matrix ordering: natural
>>>> factor fill ratio given 1., needed 1.
>>>> Factored matrix follows:
>>>> Mat Object: 1 MPI processes
>>>> type: seqaij
>>>> rows=92969, cols=92969
>>>> package used to perform factorization: petsc
>>>> total: nonzeros=638417, allocated nonzeros=638417
>>>> total number of mallocs used during MatSetValues calls=0
>>>> not using I-node routines
>>>> linear system matrix = precond matrix:
>>>> Mat Object: 1 MPI processes
>>>> type: seqaij
>>>> rows=92969, cols=92969
>>>> total: nonzeros=638417, allocated nonzeros=638417
>>>> total number of mallocs used during MatSetValues calls=0
>>>> not using I-node routines
>>>> linear system matrix = precond matrix:
>>>> Mat Object: 1 MPI processes
>>>> type: mpiaij
>>>> rows=278906, cols=278906
>>>> total: nonzeros=3274027, allocated nonzeros=3274027
>>>> total number of mallocs used during MatSetValues calls=0
>>>> not using I-node (on process 0) routines
>>>> ...
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event Count Time (sec) Flop
>>>> --- Global --- --- Stage ---- Total
>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>
>>>> --- Event Stage 0: Main Stage
>>>>
>>>> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatMult 75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 28 24 0 0 0 28 24 0 0 0 1189
>>>> MatSolve 228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 944
>>>
>>> These flop rates are ok, but not great. Perhaps an older machine.
>>>
>>>> MatLUFactorNum 3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 290
>>>> MatILUFactorSym 3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatAssemblyBegin 5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatAssemblyEnd 5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatCreateSubMats 1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
>>>> MatGetOrdering 3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> MatView 3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecMDot 72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 21 28 0 0 0 21 28 0 0 0 1877
>>>
>>> 21 percent of the time in VecMDOT this is huge for s sequential fun. I
>>> think maybe you are using a terrible OpenMP BLAS?
>>>
>>> Send configure.log
>>>
>>>
>>>> VecNorm 76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 2 0 0 0 3 2 0 0 0 1007
>>>> VecScale 75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 4587
>>>> VecCopy 3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecSet 276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
>>>> VecAXPY 6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3575
>>>> VecMAXPY 75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 18 30 0 0 0 18 30 0 0 0 2405
>>>> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecScatterBegin 76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> VecNormalize 75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 3 0 0 0 3 3 0 0 0 1360
>>>> KSPSetUp 4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
>>>> KSPSolve 1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 93100 0 0 0 93100 0 0 0 1489
>>>> KSPGMRESOrthog 72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 37 57 0 0 0 37 57 0 0 0 2110
>>>> PCSetUp 4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 3 0 0 0 0 3 0 0 0 0 153
>>>> PCSetUpOnBlocks 1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 209
>>>> PCApply 76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925
>>>> PCApplyOnBlocks 228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00
>>>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925
>>>> ————————————————————————————————————————————————————————————
>>>> ...
>>>> #PETSc Option Table entries:
>>>> -ksp_monitor
>>>> -ksp_type gmres
>>>> -ksp_view
>>>> -log_view
>>>> -pc_bjacobi_local_blocks 3
>>>> -pc_sub_type lu
>>>> -pc_type bjacobi
>>>> #End of PETSc Option Table entries
>>>> ...
>>>>
>>>> Does any of the setup/output ring a bell?
>>>>
>>>> BTW, out of curiosity - what is a “I-node” routine?
>>>>
>>>>
>>>> Cheers,
>>>> Hao
>>>>
>>>>
>>>> From: Smith, Barry F. <[email protected]>
>>>> Sent: Wednesday, February 5, 2020 9:42 PM
>>>> To: Hao DONG <[email protected]>
>>>> Cc: [email protected] <[email protected]>
>>>> Subject: Re: [petsc-users] What is the right way to implement a (block)
>>>> Diagonal ILU as PC?
>>>>
>>>>
>>>>
>>>>> On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote:
>>>>>
>>>>> Thanks a lot for your suggestions, Hong and Barry -
>>>>>
>>>>> As you suggested, I first tried the LU direct solvers (built-in and
>>>>> MUMPs) out this morning, which work perfectly, albeit slow. As my problem
>>>>> itself is a part of a PDE based optimization, the exact solution in the
>>>>> searching procedure is not necessary (I often set a relative tolerance of
>>>>> 1E-7/1E-8, etc. for Krylov subspace methods). Also tried BJACOBI with
>>>>> exact LU, the KSP just converges in one or two iterations, as expected.
>>>>>
>>>>> I added -kspview option for the D-ILU test (still with Block Jacobi as
>>>>> preconditioner and bcgs as KSP solver). The KSPview output from one of
>>>>> the examples in a toy model looks like:
>>>>>
>>>>> KSP Object: 1 MPI processes
>>>>> type: bcgs
>>>>> maximum iterations=120, nonzero initial guess
>>>>> tolerances: relative=1e-07, absolute=1e-50, divergence=10000.
>>>>> left preconditioning
>>>>> using PRECONDITIONED norm type for convergence test
>>>>> PC Object: 1 MPI processes
>>>>> type: bjacobi
>>>>> number of blocks = 3
>>>>> Local solve is same for all blocks, in the following KSP and PC
>>>>> objects:
>>>>> KSP Object: (sub_) 1 MPI processes
>>>>> type: preonly
>>>>> maximum iterations=10000, initial guess is zero
>>>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000.
>>>>> left preconditioning
>>>>> using NONE norm type for convergence test
>>>>> PC Object: (sub_) 1 MPI processes
>>>>> type: ilu
>>>>> out-of-place factorization
>>>>> 0 levels of fill
>>>>> tolerance for zero pivot 2.22045e-14
>>>>> matrix ordering: natural
>>>>> factor fill ratio given 1., needed 1.
>>>>> Factored matrix follows:
>>>>> Mat Object: 1 MPI processes
>>>>> type: seqaij
>>>>> rows=11294, cols=11294
>>>>> package used to perform factorization: petsc
>>>>> total: nonzeros=76008, allocated nonzeros=76008
>>>>> total number of mallocs used during MatSetValues calls=0
>>>>> not using I-node routines
>>>>> linear system matrix = precond matrix:
>>>>> Mat Object: 1 MPI processes
>>>>> type: seqaij
>>>>> rows=11294, cols=11294
>>>>> total: nonzeros=76008, allocated nonzeros=76008
>>>>> total number of mallocs used during MatSetValues calls=0
>>>>> not using I-node routines
>>>>> linear system matrix = precond matrix:
>>>>> Mat Object: 1 MPI processes
>>>>> type: mpiaij
>>>>> rows=33880, cols=33880
>>>>> total: nonzeros=436968, allocated nonzeros=436968
>>>>> total number of mallocs used during MatSetValues calls=0
>>>>> not using I-node (on process 0) routines
>>>>>
>>>>> do you see something wrong with my setup?
>>>>>
>>>>> I also tried a quick performance test with a small 278906 by 278906
>>>>> matrix (3850990 nnz) with the following parameters:
>>>>>
>>>>> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type
>>>>> ilu -ksp_view
>>>>>
>>>>> Reducing the relative residual to 1E-7
>>>>>
>>>>> Took 4.08s with 41 bcgs iterations.
>>>>>
>>>>> Merely changing the -pc_bjacobi_local_blocks to 6
>>>>>
>>>>> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s
>>>>> with 101 bcgs iterations.
>>>>
>>>> This is normal. more blocks slower convergence
>>>>>
>>>>> As a reference, my home-brew Fortran code solves the same problem
>>>>> (3-block D-ILU0) in
>>>>>
>>>>> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…
>>>>>
>>>> Run the PETSc code with optimization ./configure --with-debugging=0 an
>>>> run the code with -log_view this will show where the PETSc code is
>>>> spending the time (send it to use)
>>>>
>>>>
>>>>>
>>>>>
>>>>> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if
>>>>> a user is allowed to provide his own (separate) L and U Mat for
>>>>> preconditioning (like how it works in Matlab solvers), e.g.
>>>>>
>>>>> x = qmr(A,b,Tol,MaxIter,L,U,x)
>>>>>
>>>>> As I already explicitly constructed the L and U matrices in Fortran, it
>>>>> is not hard to convert them to Mat in Petsc to test Petsc and my Fortran
>>>>> code head-to-head. In that case, the A, b, x, and L/U are all identical,
>>>>> it would be easier to see where the problem came from.
>>>>>
>>>>>
>>>> No, we don't provide this kind of support
>>>>
>>>>
>>>>>
>>>>> BTW, there is another thing I wondered - is there a way to output
>>>>> residual in unpreconditioned norm? I tried to
>>>>>
>>>>> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)
>>>>>
>>>>> But always get an error that current ksp does not support
>>>>> unpreconditioned in LEFT/RIGHT (either way). Is it possible to do that
>>>>> (output unpreconditioned residual) in PETSc at all?
>>>>
>>>> -ksp_monitor_true_residual You can also run GMRES (and some other
>>>> methods) with right preconditioning, -ksp_pc_side right then the residual
>>>> computed is by the algorithm the unpreconditioned residual
>>>>
>>>> KSPSetNormType sets the norm used in the algorithm, it generally always
>>>> has to left or right, only a couple algorithms support unpreconditioned
>>>> directly.
>>>>
>>>> Barry
>>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Hao
>>>>>
>>>>>
>>>>> From: Smith, Barry F. <[email protected]>
>>>>> Sent: Tuesday, February 4, 2020 8:27 PM
>>>>> To: Hao DONG <[email protected]>
>>>>> Cc: [email protected] <[email protected]>
>>>>> Subject: Re: [petsc-users] What is the right way to implement a (block)
>>>>> Diagonal ILU as PC?
>>>>>
>>>>>
>>>>>
>>>>>> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>>
>>>>>> I have a few questions about the implementation of diagonal ILU PC in
>>>>>> PETSc. I want to solve a very simple system with KSP (in parallel), the
>>>>>> nature of the system (finite difference time-harmonic Maxwell) is
>>>>>> probably not important to the question itself. Long story short, I just
>>>>>> need to solve a set of equations of Ax = b with a block diagonal system
>>>>>> matrix, like (not sure if the mono font works):
>>>>>>
>>>>>> |X |
>>>>>> A =| Y |
>>>>>> | Z|
>>>>>>
>>>>>> Note that A is not really block-diagonal, it’s just a multi-diagonal
>>>>>> matrix determined by the FD mesh, where most elements are close to
>>>>>> diagonal. So instead of a full ILU decomposition, a D-ILU is a good
>>>>>> approximation as a preconditioner, and the number of blocks should not
>>>>>> matter too much:
>>>>>>
>>>>>> |Lx | |Ux |
>>>>>> L = | Ly | and U = | Uy |
>>>>>> | Lz| | Uz|
>>>>>>
>>>>>> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N
>>>>>> blocks) is quite efficient with Krylov subspace methods like BiCGstab or
>>>>>> QMR in my sequential Fortran/Matlab code.
>>>>>>
>>>>>> So like most users, I am looking for a parallel implement with this
>>>>>> problem in PETSc. After looking through the manual, it seems that the
>>>>>> most straightforward way to do it is through PCBJACOBI. Not sure I
>>>>>> understand it right, I just setup a 3-block PCJACOBI and give each of
>>>>>> the block a KSP with PCILU. Is this supposed to be equivalent to my
>>>>>> D-ILU preconditioner? My little block of fortran code would look like:
>>>>>> ...
>>>>>> call PCBJacobiSetTotalBlocks(pc_local,Ntotal, &
>>>>>> & isubs,ierr)
>>>>>> call PCBJacobiSetLocalBlocks(pc_local, Nsub, &
>>>>>> & isubs(istart:iend),ierr)
>>>>>> ! set up the block jacobi structure
>>>>>> call KSPSetup(ksp_local,ierr)
>>>>>> ! allocate sub ksps
>>>>>> allocate(ksp_sub(Nsub))
>>>>>> call PCBJacobiGetSubKSP(pc_local,Nsub,istart, &
>>>>>> & ksp_sub,ierr)
>>>>>> do i=1,Nsub
>>>>>> call KSPGetPC(ksp_sub(i),pc_sub,ierr)
>>>>>> !ILU preconditioner
>>>>>> call PCSetType(pc_sub,ptype,ierr)
>>>>>> call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
>>>>>> call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
>>>>>> end do
>>>>>> call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
>>>>>> & PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
>>>>>> …
>>>>>
>>>>> This code looks essentially right. You should call with -ksp_view to
>>>>> see exactly what PETSc is using for a solver.
>>>>>
>>>>>>
>>>>>> I understand that the parallel performance may not be comparable, so I
>>>>>> first set up a one-process test (with MPIAij, but all the rows are local
>>>>>> since there is only one process). The system is solved without any
>>>>>> problem (identical results within error). But the performance is
>>>>>> actually a lot worse (code built without debugging flags in performance
>>>>>> tests) than my own home-brew implementation in Fortran (I wrote my own
>>>>>> ILU0 in CSR sparse matrix format), which is hard to believe. I suspect
>>>>>> the difference is from the PC as the PETSc version took much more
>>>>>> BiCGstab iterations (60-ish vs 100-ish) to converge to the same relative
>>>>>> tol.
>>>>>
>>>>> PETSc uses GMRES by default with a restart of 30 and left
>>>>> preconditioning. It could be different exact choices in the solver (which
>>>>> is why -ksp_view is so useful) can explain the differences in the runs
>>>>> between your code and PETSc's
>>>>>>
>>>>>> This is further confirmed when I change the setup of D-ILU (using 6 or 9
>>>>>> blocks instead of 3). While my Fortran/Matlab codes see minimal
>>>>>> performance difference (<5%) when I play with the D-ILU setup,
>>>>>> increasing the number of D-ILU blocks from 3 to 9 caused the ksp setup
>>>>>> with PCBJACOBI to suffer a performance decrease of more than 50% in
>>>>>> sequential test.
>>>>>
>>>>> This is odd, the more blocks the smaller each block so the quicker the
>>>>> ILU set up should be. You can run various cases with -log_view and send
>>>>> us the output to see what is happening at each part of the computation in
>>>>> time.
>>>>>
>>>>>> So my implementation IS somewhat different in PETSc. Do I miss something
>>>>>> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding
>>>>>> of how PCBJACOBI works in PETSc?
>>>>>
>>>>> Probably not.
>>>>>>
>>>>>> If this is not the right way to implement a block diagonal ILU as
>>>>>> (parallel) PC, please kindly point me to the right direction. I searched
>>>>>> through the mail list to find some answers, only to find a couple of
>>>>>> similar questions... An example would be nice.
>>>>>
>>>>> You approach seems fundamentally right but I cannot be sure of possible
>>>>> bugs.
>>>>>>
>>>>>> On the other hand, does PETSc support a simple way to use explicit L/U
>>>>>> matrix as a preconditioner? I can import the D-ILU matrix (I already
>>>>>> converted my A matrix into Mat) constructed in my Fortran code to make a
>>>>>> better comparison. Or do I have to construct my own PC using PCshell? If
>>>>>> so, is there a good tutorial/example to learn about how to use PCSHELL
>>>>>> (in a more advanced sense, like how to setup pc side and transpose)?
>>>>>
>>>>> Not sure what you mean by explicit L/U matrix as a preconditioner. As
>>>>> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or
>>>>> Pastix as the solver. You do not need any shell code. You simply need to
>>>>> set the PCType to lu
>>>>>
>>>>> You can also set all this options from the command line and don't need
>>>>> to write the code specifically. So call KSPSetFromOptions() and then for
>>>>> example
>>>>>
>>>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this
>>>>> last one is applied to each block so you could use -pc_type lu and it
>>>>> would use lu on each block.)
>>>>>
>>>>> -ksp_type_none -pc_type lu -pc_factor_mat_solver_type mumps (do
>>>>> parallel LU with mumps)
>>>>>
>>>>> By not hardwiring in the code and just using options you can test out
>>>>> different cases much quicker
>>>>>
>>>>> Use -ksp_view to make sure that is using the solver the way you expect.
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>>
>>>>> Barry
>>>>>
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> Hao
>>>
>>> <configure.log><configure.log>
>>
>