Hi Barry, Thank you for you suggestions (and insights)! Indeed my initial motivation to try out PETSc is the different methods. As my matrix pattern is relatively simple (3D time-harmonic Maxwell equation arises from stagger-grid finite difference), also considering the fact that I am not wealthy enough to utilize the direct solvers, I was looking for a fast Krylov subspace method / preconditioner that scale well with, say, tens of cpu cores.
As a simple block-Jacobian preconditioner seems to lose its efficiency with more than a handful of blocks, I planned to look into other methods/preconditioners, e.g. multigrid (as preconditioner) or domain decomposition methods. But I will probably need to look through a number of literatures before laying my hands on those (or bother you with more questions!). Anyway, thanks again for your kind help. All the best, Hao > On Feb 8, 2020, at 8:02 AM, Smith, Barry F. <[email protected]> wrote: > > > >> On Feb 7, 2020, at 7:44 AM, Hao DONG <[email protected]> wrote: >> >> Thanks, Barry, I really appreciate your help - >> >> I removed the OpenMP flags and rebuilt PETSc. So the problem is from the >> BLAS lib I linked? > > Yes, the openmp causes it to run in parallel, but the problem is not big > enough and the machine is not good enough for parallel BLAS to speed things > up, instead it slows things down a lot. We see this often, parallel BLAS must > be used with care > >> Not sure which version my BLAS is, though… But I also included the >> -download-Scalapack option. Shouldn’t that enable linking with PBLAS >> automatically? >> >> After looking at the bcgs code in PETSc, I suppose the iteration residual >> recorded is indeed recorded twice per one "actual iteration”. So that can >> explain the difference of iteration numbers. >> >> My laptop is indeed an old machine (MBP15 mid-2014). I just cannot work with >> vi without a physical ESC key... > > The latest has a physical ESC, I am stuff without the ESC for a couple more > years. > >> I have attached the configure.log -didn’t know that it is so large! >> >> Anyway, it seems that the removal of -openmp changes quite a lot of things, >> the performance is indeed getting much better - the flop/sec increases by a >> factor of 3. Still, I am getting 20 percent of VecMDot, but no VecMDot in >> BCGS all (see output below), is that a feature of gmres method? > > Yes, GMRES orthogonalizes against the last restart directions which uses > these routines while BCGS does not, this is why BCGS is cheaper per > iteration. > > PETSc is no faster than your code because the algorithm is the same, the > compilers the same, and the hardware the same. No way to have clever tricks > for PETSc to be much faster. What PETS provides is a huge variety of tested > algorithms that no single person could code on their own. Anything in PETSc > you could code yourself if you had endless time and get basically the same > performance. > > Barry > > >> >> here is the output of the same problem with: >> >> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type gmres >> -ksp_monitor -ksp_view >> >> >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by >> donghao Fri Feb 7 10:26:19 2020 >> Using Petsc Release Version 3.12.3, unknown >> >> Max Max/Min Avg Total >> Time (sec): 2.520e+00 1.000 2.520e+00 >> Objects: 1.756e+03 1.000 1.756e+03 >> Flop: 7.910e+09 1.000 7.910e+09 7.910e+09 >> Flop/sec: 3.138e+09 1.000 3.138e+09 3.138e+09 >> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 >> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 >> MPI Reductions: 0.000e+00 0.000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length N --> >> 2N flop >> and VecAXPY() for complex vectors of length N >> --> 8N flop >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >> -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total Count %Total >> Avg %Total Count %Total >> 0: Main Stage: 2.5204e+00 100.0% 7.9096e+09 100.0% 0.000e+00 0.0% >> 0.000e+00 0.0% 0.000e+00 0.0% >> >> ------------------------------------------------------------------------------------------------------------------------ >> … >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >> ------------------------------------------------------------------------------------------------------------------------ >> >> --- Event Stage 0: Main Stage >> >> BuildTwoSidedF 1 1.0 3.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatMult 75 1.0 6.2884e-01 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 25 24 0 0 0 25 24 0 0 0 2991 >> MatSolve 228 1.0 4.4164e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2445 >> MatLUFactorNum 3 1.0 4.1317e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 2 0 0 0 0 2 0 0 0 0 574 >> MatILUFactorSym 3 1.0 2.3858e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatAssemblyBegin 5 1.0 4.4000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatAssemblyEnd 5 1.0 1.5067e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatGetRowIJ 3 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatCreateSubMats 1 1.0 2.4558e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatGetOrdering 3 1.0 1.3290e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatView 3 1.0 1.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecMDot 72 1.0 4.9875e-01 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 20 28 0 0 0 20 28 0 0 0 4509 >> VecNorm 76 1.0 6.6666e-02 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 3 2 0 0 0 3 2 0 0 0 2544 >> VecScale 75 1.0 1.7982e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 1 1 0 0 0 1 1 0 0 0 4653 >> VecCopy 3 1.0 1.5080e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecSet 276 1.0 9.6784e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 4 0 0 0 0 4 0 0 0 0 0 >> VecAXPY 6 1.0 3.6860e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3632 >> VecMAXPY 75 1.0 4.0490e-01 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 16 30 0 0 0 16 30 0 0 0 5951 >> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecAssemblyEnd 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecScatterBegin 76 1.0 5.3800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecNormalize 75 1.0 8.3690e-02 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 3 3 0 0 0 3 3 0 0 0 2999 >> KSPSetUp 4 1.0 1.1663e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPSolve 1 1.0 2.2119e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 88100 0 0 0 88100 0 0 0 3576 >> KSPGMRESOrthog 72 1.0 8.7843e-01 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 35 57 0 0 0 35 57 0 0 0 5121 >> PCSetUp 4 1.0 9.2448e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 4 0 0 0 0 4 0 0 0 0 257 >> PCSetUpOnBlocks 1 1.0 6.6597e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 3 0 0 0 0 3 0 0 0 0 356 >> PCApply 76 1.0 4.6281e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2333 >> PCApplyOnBlocks 228 1.0 4.6262e-01 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 18 14 0 0 0 18 14 0 0 0 2334 >> ------------------------------------------------------------------------------------------------------------------------ >> >> Average time to get PetscTime(): 1e-07 >> #PETSc Option Table entries: >> -I LBFGS >> -ksp_type gmres >> -ksp_view >> -log_view >> -pc_bjacobi_local_blocks 3 >> -pc_type bjacobi >> -sub_pc_type ilu >> #End of PETSc Option Table entries >> Compiled with FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 16 sizeof(PetscInt) 4 >> Configure options: --with-scalar-type=complex --download-mumps >> --download-scalapack --with-fortran-kernels=1 -- FOPTFLAGS="-O3 >> -ffree-line-length-0 -msse2" --COPTFLAGS="-O3 -msse2" --CXXOPTFLAGS="-O3 >> -msse2" --with-debugging=0 >> ----------------------------------------- >> Libraries compiled on 2020-02-07 10:07:42 on Haos-MBP >> Machine characteristics: Darwin-19.3.0-x86_64-i386-64bit >> Using PETSc directory: /Users/donghao/src/git/PETSc-current >> Using PETSc arch: arch-darwin-c-opt >> ----------------------------------------- >> >> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing >> -Wno-unknown-pragmas -fstack-protector -fno-stack- check >> -Qunused-arguments -fvisibility=hidden -O3 -msse2 >> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0 >> -Wno-unused-dummy-argument -O3 -ffree-line-length-0 - msse2 >> ----------------------------------------- >> >> Using include paths: -I/Users/donghao/src/git/PETSc-current/include >> -I/Users/donghao/src/git/PETSc-current/arch-darwin- c-opt/include >> ----------------------------------------- >> >> Using C linker: mpicc >> Using Fortran linker: mpif90 >> Using libraries: >> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >> -L/Users/donghao/src/git/PETSc- current/arch-darwin-c-opt/lib -lpetsc >> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >> -L/Users/ donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/ lib >> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib >> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ >> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 >> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple- >> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 >> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps - ldmumps -lsmumps >> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl >> -lmpi_usempif08 - lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi >> -lgfortran -lquadmath -lm -lc++ -ldl >> ----------------------------------------- >> >> >> >> The BCGS solver performance is now comparable to my own Fortran code >> (1.84s). Still, I feel that there is something wrong hidden somewhere in my >> setup - a professional lib should to perform better, I believe. Any other >> ideas that I can look into? Interestingly there is no VecMDot operation at >> all! Here is the output with the option of: >> >> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -sub_pc_type ilu -ksp_type bcgs >> -ksp_monitor -ksp_view >> >> >> >> >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> Mod3DMT.test on a arch-darwin-c-opt named Haos-MBP with 1 processor, by >> donghao Fri Feb 7 10:38:00 2020 >> Using Petsc Release Version 3.12.3, unknown >> >> Max Max/Min Avg Total >> Time (sec): 2.187e+00 1.000 2.187e+00 >> Objects: 1.155e+03 1.000 1.155e+03 >> Flop: 4.311e+09 1.000 4.311e+09 4.311e+09 >> Flop/sec: 1.971e+09 1.000 1.971e+09 1.971e+09 >> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 >> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 >> MPI Reductions: 0.000e+00 0.000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length N --> >> 2N flop >> and VecAXPY() for complex vectors of length N >> --> 8N flop >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >> -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total Count %Total >> Avg %Total Count %Total >> 0: Main Stage: 2.1870e+00 100.0% 4.3113e+09 100.0% 0.000e+00 0.0% >> 0.000e+00 0.0% 0.000e+00 0.0% >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >> ------------------------------------------------------------------------------------------------------------------------ >> >> --- Event Stage 0: Main Stage >> >> BuildTwoSidedF 1 1.0 2.2000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatMult 83 1.0 7.8726e-01 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 36 48 0 0 0 36 48 0 0 0 2644 >> MatSolve 252 1.0 5.5656e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 25 28 0 0 0 25 28 0 0 0 2144 >> MatLUFactorNum 3 1.0 4.5115e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 2 1 0 0 0 2 1 0 0 0 526 >> MatILUFactorSym 3 1.0 2.5103e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatAssemblyBegin 5 1.0 3.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatAssemblyEnd 5 1.0 1.5709e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatGetRowIJ 3 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatCreateSubMats 1 1.0 2.8989e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> MatGetOrdering 3 1.0 1.1200e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> MatView 3 1.0 1.2600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecDot 82 1.0 8.9328e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 4 4 0 0 0 4 4 0 0 0 2048 >> VecDotNorm2 41 1.0 9.9019e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 5 4 0 0 0 5 4 0 0 0 1848 >> VecNorm 43 1.0 3.9988e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 2 2 0 0 0 2 2 0 0 0 2399 >> VecCopy 2 1.0 1.1150e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecSet 271 1.0 4.2833e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 >> VecAXPY 1 1.0 5.9200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3769 >> VecAXPBYCZ 82 1.0 1.1448e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 5 8 0 0 0 5 8 0 0 0 3196 >> VecWAXPY 82 1.0 6.7460e-02 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 3 4 0 0 0 3 4 0 0 0 2712 >> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> VecScatterBegin 84 1.0 5.2800e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> KSPSetUp 4 1.0 1.4765e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >> KSPSolve 1 1.0 1.8514e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 85100 0 0 0 85100 0 0 0 2329 >> PCSetUp 4 1.0 1.0193e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 5 1 0 0 0 5 1 0 0 0 233 >> PCSetUpOnBlocks 1 1.0 7.1421e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >> 0.0e+00 3 1 0 0 0 3 1 0 0 0 332 >> PCApply 84 1.0 5.7927e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 26 28 0 0 0 26 28 0 0 0 2060 >> PCApplyOnBlocks 252 1.0 5.7902e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 26 28 0 0 0 26 28 0 0 0 2061 >> ------------------------------------------------------------------------------------------------------------------------ >> >> >> Cheers, >> Hao >> >> >> >> From: Smith, Barry F. <[email protected]> >> Sent: Thursday, February 6, 2020 7:03 PM >> To: Hao DONG <[email protected]> >> Cc: [email protected] <[email protected]> >> Subject: Re: [petsc-users] What is the right way to implement a (block) >> Diagonal ILU as PC? >> >> >> Read my comments ALL the way down, they go a long way. >> >>> On Feb 6, 2020, at 3:43 AM, Hao DONG <[email protected]> wrote: >>> >>> Dear Hong and Barry, >>> >>> Thanks for the suggestions. So there could be some problems in my PETSc >>> configuration? - but my PETSc lib was indeed compiled without the debug >>> flags (--with-debugging=0). I use GCC/GFortran (Home-brew GCC 9.2.0) for >>> the compiling and building of PETSc and my own fortran code. My Fortran >>> compiling flags are simply like: >>> >>> -O3 -ffree-line-length-none -fastsse >>> >>> Which is also used for -FOPTFLAGS in PETSc (I added -openmp for PETSc, but >>> not my fortran code, as I don’t have any OMP optimizations in my program). >>> Note the performance test results I listed yesterday (e.g. 4.08s with 41 >>> bcgs iterations.) are without any CSR-array->PETSc translation overhead >>> (only include the set and solve part). >> >> PETSc doesn't use -openmp in any way for its solvers. Do not use this >> option, it may be slowing the code down. Please send configure.log >> >>> >>> I have two questions about the performance difference - >>> >>> 1. Is ilu only factorized once for each iteration, or ilu is performed at >>> every outer ksp iteration steps? Sounds unlikely - but if so, this could >>> cause some extra overheads. >> >> ILU is ONLY done if the matrix has changed (which seems wrong). >>> >>> 2. Some KSP solvers like BCGS or TFQMR has two “half-iterations” for each >>> iteration step. Not sure how it works in PETSc, but is that possible that >>> both the two “half" relative residuals are counted in the output array, >>> doubling the number of iterations (but that cannot explain the extra time >>> consumed)? >> >> Yes, PETSc might report them as two, you need to check the exact code. >> >>> >>> Anyway, the output with -log_view from the same 278906 by 278906 matrix >>> with 3-block D-ILU in PETSc is as follows: >>> >>> >>> ---------------------------------------------- PETSc Performance Summary: >>> ---------------------------------------------- >>> >>> MEMsolv.lu on a arch-darwin-c-opt named Haos-MBP with 1 processor, by >>> donghao Thu Feb 6 09:07:35 2020 >>> Using Petsc Release Version 3.12.3, unknown >>> >>> Max Max/Min Avg Total >>> Time (sec): 4.443e+00 1.000 4.443e+00 >>> Objects: 1.155e+03 1.000 1.155e+03 >>> Flop: 4.311e+09 1.000 4.311e+09 4.311e+09 >>> Flop/sec: 9.703e+08 1.000 9.703e+08 9.703e+08 >>> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 >>> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 >>> MPI Reductions: 0.000e+00 0.000 >>> >>> Flop counting convention: 1 flop = 1 real number operation of type >>> (multiply/divide/add/subtract) >>> e.g., VecAXPY() for real vectors of length N >>> --> 2N flop >>> and VecAXPY() for complex vectors of length N >>> --> 8N flop >>> >>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >>> -- Message Lengths -- -- Reductions -- >>> Avg %Total Avg %Total Count %Total >>> Avg %Total Count %Total >>> 0: Main Stage: 4.4435e+00 100.0% 4.3113e+09 100.0% 0.000e+00 0.0% >>> 0.000e+00 0.0% 0.000e+00 0.0% >>> >>> ———————————————————————————————————————————————————————————— >>> See the 'Profiling' chapter of the users' manual for details on >>> interpreting output. >>> Phase summary info: >>> Count: number of times phase was executed >>> Time and Flop: Max - maximum over all processors >>> Ratio - ratio of maximum to minimum over all processors >>> Mess: number of messages sent >>> AvgLen: average message length (bytes) >>> Reduct: number of global reductions >>> Global: entire computation >>> Stage: stages of a computation. Set stages with PetscLogStagePush() and >>> PetscLogStagePop(). >>> %T - percent time in this phase %F - percent flop in this >>> phase >>> %M - percent messages in this phase %L - percent message lengths >>> in this phase >>> %R - percent reductions in this phase >>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over >>> all processors) >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMult 83 1.0 1.7815e+00 1.0 2.08e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 40 48 0 0 0 40 48 0 0 0 1168 >>> MatSolve 252 1.0 1.2708e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 939 >>> MatLUFactorNum 3 1.0 7.9725e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 2 1 0 0 0 2 1 0 0 0 298 >>> MatILUFactorSym 3 1.0 2.6998e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatAssemblyBegin 5 1.0 3.6000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatAssemblyEnd 5 1.0 3.1619e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatCreateSubMats 1 1.0 3.9659e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatGetOrdering 3 1.0 4.3070e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatView 3 1.0 1.3600e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecDot 82 1.0 1.8948e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 4 4 0 0 0 4 4 0 0 0 966 >>> VecDotNorm2 41 1.0 1.6812e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 4 4 0 0 0 4 4 0 0 0 1088 >>> VecNorm 43 1.0 9.5099e-02 1.0 9.59e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 2 2 0 0 0 2 2 0 0 0 1009 >>> VecCopy 2 1.0 1.0120e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecSet 271 1.0 3.8922e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> VecAXPY 1 1.0 7.7200e-04 1.0 2.23e+06 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 2890 >>> VecAXPBYCZ 82 1.0 2.4370e-01 1.0 3.66e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 5 8 0 0 0 5 8 0 0 0 1502 >>> VecWAXPY 82 1.0 1.4148e-01 1.0 1.83e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 4 0 0 0 3 4 0 0 0 1293 >>> VecAssemblyBegin 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecScatterBegin 84 1.0 5.9300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPSetUp 4 1.0 1.4167e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPSolve 1 1.0 4.0250e+00 1.0 4.31e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 91100 0 0 0 91100 0 0 0 1071 >>> PCSetUp 4 1.0 1.5207e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 1 0 0 0 3 1 0 0 0 156 >>> PCSetUpOnBlocks 1 1.0 1.1116e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 1 0 0 0 3 1 0 0 0 214 >>> PCApply 84 1.0 1.2912e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924 >>> PCApplyOnBlocks 252 1.0 1.2909e+00 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 29 28 0 0 0 29 28 0 0 0 924 >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> # I skipped the memory part - the options (and compiler options) are as >>> follows: >>> >>> #PETSc Option Table entries: >>> -ksp_type bcgs >>> -ksp_view >>> -log_view >>> -pc_bjacobi_local_blocks 3 >>> -pc_factor_levels 0 >>> -pc_sub_type ilu >>> -pc_type bjacobi >>> #End of PETSc Option Table entries >>> Compiled with FORTRAN kernels >>> Compiled with full precision matrices (default) >>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >>> sizeof(PetscScalar) 16 sizeof(PetscInt) 4 >>> Configure options: --with-scalar-type=complex --download-mumps >>> --download-scalapack --with-fortran-kernels=1 -- FOPTFLAGS=“-O3 >>> -fastsse -mp -openmp” --COPTFLAGS=“-O3 -fastsse -mp -openmp” >>> --CXXOPTFLAGS="-O3 -fastsse -mp -openmp" -- with-debugging=0 >>> ----------------------------------------- >>> Libraries compiled on 2020-02-03 10:44:31 on Haos-MBP >>> Machine characteristics: Darwin-19.2.0-x86_64-i386-64bit >>> Using PETSc directory: /Users/donghao/src/git/PETSc-current >>> Using PETSc arch: arch-darwin-c-opt >>> ----------------------------------------- >>> >>> Using C compiler: mpicc -Wall -Wwrite-strings -Wno-strict-aliasing >>> -Wno-unknown-pragmas -fstack-protector -fno-stack- check >>> -Qunused-arguments -fvisibility=hidden >>> Using Fortran compiler: mpif90 -Wall -ffree-line-length-0 >>> -Wno-unused-dummy-argument >>> >>> Using include paths: -I/Users/donghao/src/git/PETSc-current/include >>> -I/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/include >>> ----------------------------------------- >>> >>> Using C linker: mpicc >>> Using Fortran linker: mpif90 >>> Using libraries: >>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >>> -L/Users/donghao/src/git/PETSc- current/arch-darwin-c-opt/lib -lpetsc >>> -Wl,-rpath,/Users/donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >>> -L/Users/ donghao/src/git/PETSc-current/arch-darwin-c-opt/lib >>> -Wl,-rpath,/usr/local/opt/libevent/lib -L/usr/local/opt/libevent/ lib >>> -Wl,-rpath,/usr/local/Cellar/open-mpi/4.0.2/lib >>> -L/usr/local/Cellar/open-mpi/4.0.2/lib -Wl,-rpath,/usr/local/Cellar/ >>> gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple-darwin19/9.2.0 >>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9/gcc/x86_64-apple- >>> darwin19/9.2.0 -Wl,-rpath,/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 >>> -L/usr/local/Cellar/gcc/9.2.0_3/lib/gcc/9 -lcmumps - ldmumps -lsmumps >>> -lzmumps -lmumps_common -lpord -lscalapack -llapack -lblas -lc++ -ldl >>> -lmpi_usempif08 - lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi >>> -lgfortran -lquadmath -lm -lc++ -ldl >>> >>> >>> On the other hand, running PETSc with >>> >>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type lu -ksp_type gmres >>> -ksp_monitor -ksp_view -log_view >>> >>> For the same problem takes 5.37s and 72 GMRES iterations. Our previous >>> testings show that BiCGstab (bcgs in PETSc) is almost always the most >>> effective KSP solver for our non-symmetrical complex system. Strangely, the >>> system is still using ilu instead of lu for sub blocks. The output is like: >> >> -sub_pc_type lu >> >>> >>> 0 KSP Residual norm 2.480412407430e+02 >>> 1 KSP Residual norm 8.848059967835e+01 >>> 2 KSP Residual norm 3.415272863261e+01 >>> 3 KSP Residual norm 1.563045190939e+01 >>> 4 KSP Residual norm 6.241296940043e+00 >>> 5 KSP Residual norm 2.739710899854e+00 >>> 6 KSP Residual norm 1.391304148888e+00 >>> 7 KSP Residual norm 7.959262020849e-01 >>> 8 KSP Residual norm 4.828323055231e-01 >>> 9 KSP Residual norm 2.918529739200e-01 >>> 10 KSP Residual norm 1.905508589557e-01 >>> 11 KSP Residual norm 1.291541892702e-01 >>> 12 KSP Residual norm 8.827145774707e-02 >>> 13 KSP Residual norm 6.521331095889e-02 >>> 14 KSP Residual norm 5.095787952595e-02 >>> 15 KSP Residual norm 4.043060387395e-02 >>> 16 KSP Residual norm 3.232590200012e-02 >>> 17 KSP Residual norm 2.593944982216e-02 >>> 18 KSP Residual norm 2.064639483533e-02 >>> 19 KSP Residual norm 1.653916663492e-02 >>> 20 KSP Residual norm 1.334946415452e-02 >>> 21 KSP Residual norm 1.092886880597e-02 >>> 22 KSP Residual norm 8.988004105542e-03 >>> 23 KSP Residual norm 7.466501315240e-03 >>> 24 KSP Residual norm 6.284389135436e-03 >>> 25 KSP Residual norm 5.425231669964e-03 >>> 26 KSP Residual norm 4.766338253084e-03 >>> 27 KSP Residual norm 4.241238878242e-03 >>> 28 KSP Residual norm 3.808113525685e-03 >>> 29 KSP Residual norm 3.449383788116e-03 >>> 30 KSP Residual norm 3.126025526388e-03 >>> 31 KSP Residual norm 2.958328054299e-03 >>> 32 KSP Residual norm 2.802344900403e-03 >>> 33 KSP Residual norm 2.621993580492e-03 >>> 34 KSP Residual norm 2.430066269304e-03 >>> 35 KSP Residual norm 2.259043079597e-03 >>> 36 KSP Residual norm 2.104287972986e-03 >>> 37 KSP Residual norm 1.952916080045e-03 >>> 38 KSP Residual norm 1.804988937999e-03 >>> 39 KSP Residual norm 1.643302117377e-03 >>> 40 KSP Residual norm 1.471661332036e-03 >>> 41 KSP Residual norm 1.286445911163e-03 >>> 42 KSP Residual norm 1.127543025848e-03 >>> 43 KSP Residual norm 9.777148275484e-04 >>> 44 KSP Residual norm 8.293314450006e-04 >>> 45 KSP Residual norm 6.989331136622e-04 >>> 46 KSP Residual norm 5.852307780220e-04 >>> 47 KSP Residual norm 4.926715539762e-04 >>> 48 KSP Residual norm 4.215941372075e-04 >>> 49 KSP Residual norm 3.699489548162e-04 >>> 50 KSP Residual norm 3.293897163533e-04 >>> 51 KSP Residual norm 2.959954542998e-04 >>> 52 KSP Residual norm 2.700193032414e-04 >>> 53 KSP Residual norm 2.461789791204e-04 >>> 54 KSP Residual norm 2.218839085563e-04 >>> 55 KSP Residual norm 1.945154309976e-04 >>> 56 KSP Residual norm 1.661128781744e-04 >>> 57 KSP Residual norm 1.413198766258e-04 >>> 58 KSP Residual norm 1.213984003195e-04 >>> 59 KSP Residual norm 1.044317450754e-04 >>> 60 KSP Residual norm 8.919957502977e-05 >>> 61 KSP Residual norm 8.042584301275e-05 >>> 62 KSP Residual norm 7.292784493581e-05 >>> 63 KSP Residual norm 6.481935501872e-05 >>> 64 KSP Residual norm 5.718564652679e-05 >>> 65 KSP Residual norm 5.072589750116e-05 >>> 66 KSP Residual norm 4.487930741285e-05 >>> 67 KSP Residual norm 3.941040674119e-05 >>> 68 KSP Residual norm 3.492873281291e-05 >>> 69 KSP Residual norm 3.103798339845e-05 >>> 70 KSP Residual norm 2.822943237409e-05 >>> 71 KSP Residual norm 2.610615023776e-05 >>> 72 KSP Residual norm 2.441692671173e-05 >>> KSP Object: 1 MPI processes >>> type: gmres >>> restart=30, using Classical (unmodified) Gram-Schmidt Orthogonalization >>> with no iterative refinement >>> happy breakdown tolerance 1e-30 >>> maximum iterations=150, nonzero initial guess >>> tolerances: relative=1e-07, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using PRECONDITIONED norm type for convergence test >>> PC Object: 1 MPI processes >>> type: bjacobi >>> number of blocks = 3 >>> Local solve is same for all blocks, in the following KSP and PC objects: >>> KSP Object: (sub_) 1 MPI processes >>> type: preonly >>> maximum iterations=10000, initial guess is zero >>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>> left preconditioning >>> using NONE norm type for convergence test >>> PC Object: (sub_) 1 MPI processes >>> type: ilu >>> out-of-place factorization >>> 0 levels of fill >>> tolerance for zero pivot 2.22045e-14 >>> matrix ordering: natural >>> factor fill ratio given 1., needed 1. >>> Factored matrix follows: >>> Mat Object: 1 MPI processes >>> type: seqaij >>> rows=92969, cols=92969 >>> package used to perform factorization: petsc >>> total: nonzeros=638417, allocated nonzeros=638417 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> linear system matrix = precond matrix: >>> Mat Object: 1 MPI processes >>> type: seqaij >>> rows=92969, cols=92969 >>> total: nonzeros=638417, allocated nonzeros=638417 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node routines >>> linear system matrix = precond matrix: >>> Mat Object: 1 MPI processes >>> type: mpiaij >>> rows=278906, cols=278906 >>> total: nonzeros=3274027, allocated nonzeros=3274027 >>> total number of mallocs used during MatSetValues calls=0 >>> not using I-node (on process 0) routines >>> ... >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>> ------------------------------------------------------------------------------------------------------------------------ >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSidedF 1 1.0 2.3000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatMult 75 1.0 1.5812e+00 1.0 1.88e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 28 24 0 0 0 28 24 0 0 0 1189 >>> MatSolve 228 1.0 1.1442e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 944 >> >> These flop rates are ok, but not great. Perhaps an older machine. >> >>> MatLUFactorNum 3 1.0 8.1930e-02 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 290 >>> MatILUFactorSym 3 1.0 2.7102e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatAssemblyBegin 5 1.0 3.7000e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatAssemblyEnd 5 1.0 3.1895e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatGetRowIJ 3 1.0 2.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatCreateSubMats 1 1.0 4.0904e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 >>> MatGetOrdering 3 1.0 4.2640e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> MatView 3 1.0 1.4400e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecMDot 72 1.0 1.1984e+00 1.0 2.25e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 21 28 0 0 0 21 28 0 0 0 1877 >> >> 21 percent of the time in VecMDOT this is huge for s sequential fun. I >> think maybe you are using a terrible OpenMP BLAS? >> >> Send configure.log >> >> >>> VecNorm 76 1.0 1.6841e-01 1.0 1.70e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 2 0 0 0 3 2 0 0 0 1007 >>> VecScale 75 1.0 1.8241e-02 1.0 8.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 1 0 0 0 0 1 0 0 0 4587 >>> VecCopy 3 1.0 1.4970e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecSet 276 1.0 9.1970e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 >>> VecAXPY 6 1.0 3.7450e-03 1.0 1.34e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 3575 >>> VecMAXPY 75 1.0 1.0022e+00 1.0 2.41e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 18 30 0 0 0 18 30 0 0 0 2405 >>> VecAssemblyBegin 2 1.0 1.0000e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecAssemblyEnd 2 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecScatterBegin 76 1.0 5.5100e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> VecNormalize 75 1.0 1.8462e-01 1.0 2.51e+08 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 3 0 0 0 3 3 0 0 0 1360 >>> KSPSetUp 4 1.0 1.1341e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> KSPSolve 1 1.0 5.3123e+00 1.0 7.91e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 93100 0 0 0 93100 0 0 0 1489 >>> KSPGMRESOrthog 72 1.0 2.1316e+00 1.0 4.50e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 37 57 0 0 0 37 57 0 0 0 2110 >>> PCSetUp 4 1.0 1.5531e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 0 0 0 0 3 0 0 0 0 153 >>> PCSetUpOnBlocks 1 1.0 1.1343e-01 1.0 2.37e+07 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 2 0 0 0 0 2 0 0 0 0 209 >>> PCApply 76 1.0 1.1671e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925 >>> PCApplyOnBlocks 228 1.0 1.1668e+00 1.0 1.08e+09 1.0 0.0e+00 0.0e+00 >>> 0.0e+00 20 14 0 0 0 20 14 0 0 0 925 >>> ———————————————————————————————————————————————————————————— >>> ... >>> #PETSc Option Table entries: >>> -ksp_monitor >>> -ksp_type gmres >>> -ksp_view >>> -log_view >>> -pc_bjacobi_local_blocks 3 >>> -pc_sub_type lu >>> -pc_type bjacobi >>> #End of PETSc Option Table entries >>> ... >>> >>> Does any of the setup/output ring a bell? >>> >>> BTW, out of curiosity - what is a “I-node” routine? >>> >>> >>> Cheers, >>> Hao >>> >>> >>> From: Smith, Barry F. <[email protected]> >>> Sent: Wednesday, February 5, 2020 9:42 PM >>> To: Hao DONG <[email protected]> >>> Cc: [email protected] <[email protected]> >>> Subject: Re: [petsc-users] What is the right way to implement a (block) >>> Diagonal ILU as PC? >>> >>> >>> >>>> On Feb 5, 2020, at 4:36 AM, Hao DONG <[email protected]> wrote: >>>> >>>> Thanks a lot for your suggestions, Hong and Barry - >>>> >>>> As you suggested, I first tried the LU direct solvers (built-in and MUMPs) >>>> out this morning, which work perfectly, albeit slow. As my problem itself >>>> is a part of a PDE based optimization, the exact solution in the searching >>>> procedure is not necessary (I often set a relative tolerance of 1E-7/1E-8, >>>> etc. for Krylov subspace methods). Also tried BJACOBI with exact LU, the >>>> KSP just converges in one or two iterations, as expected. >>>> >>>> I added -kspview option for the D-ILU test (still with Block Jacobi as >>>> preconditioner and bcgs as KSP solver). The KSPview output from one of the >>>> examples in a toy model looks like: >>>> >>>> KSP Object: 1 MPI processes >>>> type: bcgs >>>> maximum iterations=120, nonzero initial guess >>>> tolerances: relative=1e-07, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using PRECONDITIONED norm type for convergence test >>>> PC Object: 1 MPI processes >>>> type: bjacobi >>>> number of blocks = 3 >>>> Local solve is same for all blocks, in the following KSP and PC >>>> objects: >>>> KSP Object: (sub_) 1 MPI processes >>>> type: preonly >>>> maximum iterations=10000, initial guess is zero >>>> tolerances: relative=1e-05, absolute=1e-50, divergence=10000. >>>> left preconditioning >>>> using NONE norm type for convergence test >>>> PC Object: (sub_) 1 MPI processes >>>> type: ilu >>>> out-of-place factorization >>>> 0 levels of fill >>>> tolerance for zero pivot 2.22045e-14 >>>> matrix ordering: natural >>>> factor fill ratio given 1., needed 1. >>>> Factored matrix follows: >>>> Mat Object: 1 MPI processes >>>> type: seqaij >>>> rows=11294, cols=11294 >>>> package used to perform factorization: petsc >>>> total: nonzeros=76008, allocated nonzeros=76008 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node routines >>>> linear system matrix = precond matrix: >>>> Mat Object: 1 MPI processes >>>> type: seqaij >>>> rows=11294, cols=11294 >>>> total: nonzeros=76008, allocated nonzeros=76008 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node routines >>>> linear system matrix = precond matrix: >>>> Mat Object: 1 MPI processes >>>> type: mpiaij >>>> rows=33880, cols=33880 >>>> total: nonzeros=436968, allocated nonzeros=436968 >>>> total number of mallocs used during MatSetValues calls=0 >>>> not using I-node (on process 0) routines >>>> >>>> do you see something wrong with my setup? >>>> >>>> I also tried a quick performance test with a small 278906 by 278906 matrix >>>> (3850990 nnz) with the following parameters: >>>> >>>> -ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type >>>> ilu -ksp_view >>>> >>>> Reducing the relative residual to 1E-7 >>>> >>>> Took 4.08s with 41 bcgs iterations. >>>> >>>> Merely changing the -pc_bjacobi_local_blocks to 6 >>>> >>>> Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with >>>> 101 bcgs iterations. >>> >>> This is normal. more blocks slower convergence >>>> >>>> As a reference, my home-brew Fortran code solves the same problem (3-block >>>> D-ILU0) in >>>> >>>> 1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)… >>>> >>> Run the PETSc code with optimization ./configure --with-debugging=0 an >>> run the code with -log_view this will show where the PETSc code is spending >>> the time (send it to use) >>> >>> >>>> >>>> >>>> Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a >>>> user is allowed to provide his own (separate) L and U Mat for >>>> preconditioning (like how it works in Matlab solvers), e.g. >>>> >>>> x = qmr(A,b,Tol,MaxIter,L,U,x) >>>> >>>> As I already explicitly constructed the L and U matrices in Fortran, it is >>>> not hard to convert them to Mat in Petsc to test Petsc and my Fortran code >>>> head-to-head. In that case, the A, b, x, and L/U are all identical, it >>>> would be easier to see where the problem came from. >>>> >>>> >>> No, we don't provide this kind of support >>> >>> >>>> >>>> BTW, there is another thing I wondered - is there a way to output residual >>>> in unpreconditioned norm? I tried to >>>> >>>> call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr) >>>> >>>> But always get an error that current ksp does not support unpreconditioned >>>> in LEFT/RIGHT (either way). Is it possible to do that (output >>>> unpreconditioned residual) in PETSc at all? >>> >>> -ksp_monitor_true_residual You can also run GMRES (and some other >>> methods) with right preconditioning, -ksp_pc_side right then the residual >>> computed is by the algorithm the unpreconditioned residual >>> >>> KSPSetNormType sets the norm used in the algorithm, it generally always >>> has to left or right, only a couple algorithms support unpreconditioned >>> directly. >>> >>> Barry >>> >>> >>>> >>>> Cheers, >>>> Hao >>>> >>>> >>>> From: Smith, Barry F. <[email protected]> >>>> Sent: Tuesday, February 4, 2020 8:27 PM >>>> To: Hao DONG <[email protected]> >>>> Cc: [email protected] <[email protected]> >>>> Subject: Re: [petsc-users] What is the right way to implement a (block) >>>> Diagonal ILU as PC? >>>> >>>> >>>> >>>>> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote: >>>>> >>>>> Dear all, >>>>> >>>>> >>>>> I have a few questions about the implementation of diagonal ILU PC in >>>>> PETSc. I want to solve a very simple system with KSP (in parallel), the >>>>> nature of the system (finite difference time-harmonic Maxwell) is >>>>> probably not important to the question itself. Long story short, I just >>>>> need to solve a set of equations of Ax = b with a block diagonal system >>>>> matrix, like (not sure if the mono font works): >>>>> >>>>> |X | >>>>> A =| Y | >>>>> | Z| >>>>> >>>>> Note that A is not really block-diagonal, it’s just a multi-diagonal >>>>> matrix determined by the FD mesh, where most elements are close to >>>>> diagonal. So instead of a full ILU decomposition, a D-ILU is a good >>>>> approximation as a preconditioner, and the number of blocks should not >>>>> matter too much: >>>>> >>>>> |Lx | |Ux | >>>>> L = | Ly | and U = | Uy | >>>>> | Lz| | Uz| >>>>> >>>>> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N >>>>> blocks) is quite efficient with Krylov subspace methods like BiCGstab or >>>>> QMR in my sequential Fortran/Matlab code. >>>>> >>>>> So like most users, I am looking for a parallel implement with this >>>>> problem in PETSc. After looking through the manual, it seems that the >>>>> most straightforward way to do it is through PCBJACOBI. Not sure I >>>>> understand it right, I just setup a 3-block PCJACOBI and give each of the >>>>> block a KSP with PCILU. Is this supposed to be equivalent to my D-ILU >>>>> preconditioner? My little block of fortran code would look like: >>>>> ... >>>>> call PCBJacobiSetTotalBlocks(pc_local,Ntotal, & >>>>> & isubs,ierr) >>>>> call PCBJacobiSetLocalBlocks(pc_local, Nsub, & >>>>> & isubs(istart:iend),ierr) >>>>> ! set up the block jacobi structure >>>>> call KSPSetup(ksp_local,ierr) >>>>> ! allocate sub ksps >>>>> allocate(ksp_sub(Nsub)) >>>>> call PCBJacobiGetSubKSP(pc_local,Nsub,istart, & >>>>> & ksp_sub,ierr) >>>>> do i=1,Nsub >>>>> call KSPGetPC(ksp_sub(i),pc_sub,ierr) >>>>> !ILU preconditioner >>>>> call PCSetType(pc_sub,ptype,ierr) >>>>> call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here >>>>> call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)] >>>>> end do >>>>> call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, & >>>>> & PETSC_DEFAULT_REAL,KSSiter%maxit,ierr) >>>>> … >>>> >>>> This code looks essentially right. You should call with -ksp_view to >>>> see exactly what PETSc is using for a solver. >>>> >>>>> >>>>> I understand that the parallel performance may not be comparable, so I >>>>> first set up a one-process test (with MPIAij, but all the rows are local >>>>> since there is only one process). The system is solved without any >>>>> problem (identical results within error). But the performance is actually >>>>> a lot worse (code built without debugging flags in performance tests) >>>>> than my own home-brew implementation in Fortran (I wrote my own ILU0 in >>>>> CSR sparse matrix format), which is hard to believe. I suspect the >>>>> difference is from the PC as the PETSc version took much more BiCGstab >>>>> iterations (60-ish vs 100-ish) to converge to the same relative tol. >>>> >>>> PETSc uses GMRES by default with a restart of 30 and left >>>> preconditioning. It could be different exact choices in the solver (which >>>> is why -ksp_view is so useful) can explain the differences in the runs >>>> between your code and PETSc's >>>>> >>>>> This is further confirmed when I change the setup of D-ILU (using 6 or 9 >>>>> blocks instead of 3). While my Fortran/Matlab codes see minimal >>>>> performance difference (<5%) when I play with the D-ILU setup, increasing >>>>> the number of D-ILU blocks from 3 to 9 caused the ksp setup with >>>>> PCBJACOBI to suffer a performance decrease of more than 50% in sequential >>>>> test. >>>> >>>> This is odd, the more blocks the smaller each block so the quicker the >>>> ILU set up should be. You can run various cases with -log_view and send us >>>> the output to see what is happening at each part of the computation in >>>> time. >>>> >>>>> So my implementation IS somewhat different in PETSc. Do I miss something >>>>> in the PCBJACOBI setup? Or do I have some fundamental misunderstanding of >>>>> how PCBJACOBI works in PETSc? >>>> >>>> Probably not. >>>>> >>>>> If this is not the right way to implement a block diagonal ILU as >>>>> (parallel) PC, please kindly point me to the right direction. I searched >>>>> through the mail list to find some answers, only to find a couple of >>>>> similar questions... An example would be nice. >>>> >>>> You approach seems fundamentally right but I cannot be sure of possible >>>> bugs. >>>>> >>>>> On the other hand, does PETSc support a simple way to use explicit L/U >>>>> matrix as a preconditioner? I can import the D-ILU matrix (I already >>>>> converted my A matrix into Mat) constructed in my Fortran code to make a >>>>> better comparison. Or do I have to construct my own PC using PCshell? If >>>>> so, is there a good tutorial/example to learn about how to use PCSHELL >>>>> (in a more advanced sense, like how to setup pc side and transpose)? >>>> >>>> Not sure what you mean by explicit L/U matrix as a preconditioner. As >>>> Hong said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or >>>> Pastix as the solver. You do not need any shell code. You simply need to >>>> set the PCType to lu >>>> >>>> You can also set all this options from the command line and don't need >>>> to write the code specifically. So call KSPSetFromOptions() and then for >>>> example >>>> >>>> -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this >>>> last one is applied to each block so you could use -pc_type lu and it >>>> would use lu on each block.) >>>> >>>> -ksp_type_none -pc_type lu -pc_factor_mat_solver_type mumps (do >>>> parallel LU with mumps) >>>> >>>> By not hardwiring in the code and just using options you can test out >>>> different cases much quicker >>>> >>>> Use -ksp_view to make sure that is using the solver the way you expect. >>>> >>>> Barry >>>> >>>> >>>> >>>> Barry >>>> >>>>> >>>>> Thanks in advance, >>>>> >>>>> Hao >> >> <configure.log><configure.log> >
