Thanks a lot for your suggestions, Hong and Barry -

As you suggested, I first tried the LU direct solvers (built-in and MUMPs) out 
this morning, which work perfectly, albeit slow. As my problem itself is a part 
of a PDE based optimization, the exact solution in the searching procedure is 
not necessary (I often set a relative tolerance of 1E-7/1E-8, etc. for Krylov 
subspace methods). Also tried BJACOBI with exact LU, the KSP just converges in 
one or two iterations, as expected.

I added -kspview option for the D-ILU test (still with Block Jacobi as 
preconditioner and bcgs as KSP solver). The KSPview output from one of the 
examples in a toy model looks like:

KSP Object: 1 MPI processes
   type: bcgs
   maximum iterations=120, nonzero initial guess
   tolerances:  relative=1e-07, absolute=1e-50, divergence=10000.
   left preconditioning
   using PRECONDITIONED norm type for convergence test
 PC Object: 1 MPI processes
   type: bjacobi
     number of blocks = 3
     Local solve is same for all blocks, in the following KSP and PC objects:
     KSP Object: (sub_) 1 MPI processes
       type: preonly
       maximum iterations=10000, initial guess is zero
       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
       left preconditioning
       using NONE norm type for convergence test
     PC Object: (sub_) 1 MPI processes
       type: ilu
         out-of-place factorization
         0 levels of fill
         tolerance for zero pivot 2.22045e-14
         matrix ordering: natural
         factor fill ratio given 1., needed 1.
           Factored matrix follows:
             Mat Object: 1 MPI processes
               type: seqaij
               rows=11294, cols=11294
               package used to perform factorization: petsc
               total: nonzeros=76008, allocated nonzeros=76008
               total number of mallocs used during MatSetValues calls=0
                 not using I-node routines
       linear system matrix = precond matrix:
       Mat Object: 1 MPI processes
         type: seqaij
         rows=11294, cols=11294
         total: nonzeros=76008, allocated nonzeros=76008
         total number of mallocs used during MatSetValues calls=0
           not using I-node routines
   linear system matrix = precond matrix:
   Mat Object: 1 MPI processes
     type: mpiaij
     rows=33880, cols=33880
     total: nonzeros=436968, allocated nonzeros=436968
     total number of mallocs used during MatSetValues calls=0
       not using I-node (on process 0) routines

do you see something wrong with my setup?

I also tried a quick performance test with a small 278906 by 278906 matrix 
(3850990 nnz) with the following parameters:

-ksp_type bcgs -pc_type bjacobi -pc_bjacobi_local_blocks 3 -pc_sub_type ilu 
-ksp_view

Reducing the relative residual to 1E-7

Took 4.08s with 41 bcgs iterations.

Merely changing the -pc_bjacobi_local_blocks to 6

Took 7.02s with 73 bcgs iterations. 9 blocks would further take 9.45s with 101 
bcgs iterations.

As a reference, my home-brew Fortran code solves the same problem (3-block 
D-ILU0) in

1.84s with 24 bcgs iterations (the bcgs code is also a home-brew one)…



Well, by saying “use explicit L/U matrix as preconditioner”, I wonder if a user 
is allowed to provide his own (separate) L and U Mat for preconditioning (like 
how it works in Matlab solvers), e.g.

x = qmr(A,b,Tol,MaxIter,L,U,x)

As I already explicitly constructed the L and U matrices in Fortran, it is not 
hard to convert them to Mat in Petsc to test Petsc and my Fortran code 
head-to-head. In that case, the A, b, x, and L/U are all identical, it would be 
easier to see where the problem came from.



BTW, there is another thing I wondered - is there a way to output residual in 
unpreconditioned norm? I tried to

call KSPSetNormType(ksp_local, KSP_NORM_UNPRECONDITIONED, ierr)

But always get an error that current ksp does not support unpreconditioned in 
LEFT/RIGHT (either way). Is it possible to do that (output unpreconditioned 
residual) in PETSc at all?

Cheers,
Hao


________________________________
From: Smith, Barry F. <[email protected]>
Sent: Tuesday, February 4, 2020 8:27 PM
To: Hao DONG <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: [petsc-users] What is the right way to implement a (block) 
Diagonal ILU as PC?



> On Feb 4, 2020, at 12:41 PM, Hao DONG <[email protected]> wrote:
>
> Dear all,
>
>
> I have a few questions about the implementation of diagonal ILU PC in PETSc. 
> I want to solve a very simple system with KSP (in parallel), the nature of 
> the system (finite difference time-harmonic Maxwell) is probably not 
> important to the question itself. Long story short, I just need to solve a 
> set of equations of Ax = b with a block diagonal system matrix, like (not 
> sure if the mono font works):
>
>    |X    |
> A =|  Y  |
>    |    Z|
>
> Note that A is not really block-diagonal, it’s just a multi-diagonal matrix 
> determined by the FD mesh, where most elements are close to diagonal. So 
> instead of a full ILU decomposition, a D-ILU is a good approximation as a 
> preconditioner, and the number of blocks should not matter too much:
>
>     |Lx      |         |Ux      |
> L = |   Ly   | and U = |   Uy   |
>     |      Lz|         |      Uz|
>
> Where [Lx, Ux] = ILU0(X), etc. Indeed, the D-ILU preconditioner (with 3N 
> blocks) is quite efficient with Krylov subspace methods like BiCGstab or QMR 
> in my sequential Fortran/Matlab code.
>
> So like most users, I am looking for a parallel implement with this problem 
> in PETSc. After looking through the manual, it seems that the most 
> straightforward way to do it is through PCBJACOBI. Not sure I understand it 
> right, I just setup a 3-block PCJACOBI and give each of the block a KSP with 
> PCILU. Is this supposed to be equivalent to my D-ILU preconditioner? My 
> little block of fortran code would look like:
> ...
>       call PCBJacobiSetTotalBlocks(pc_local,Ntotal,                   &
>      &     isubs,ierr)
>       call PCBJacobiSetLocalBlocks(pc_local, Nsub,                    &
>      &    isubs(istart:iend),ierr)
>       ! set up the block jacobi structure
>       call KSPSetup(ksp_local,ierr)
>       ! allocate sub ksps
>       allocate(ksp_sub(Nsub))
>       call PCBJacobiGetSubKSP(pc_local,Nsub,istart,                   &
>      &     ksp_sub,ierr)
>       do i=1,Nsub
>           call KSPGetPC(ksp_sub(i),pc_sub,ierr)
>           !ILU preconditioner
>           call PCSetType(pc_sub,ptype,ierr)
>           call PCFactorSetLevels(pc_sub,1,ierr) ! use ILU(1) here
>           call KSPSetType(ksp_sub(i),KSPPREONLY,ierr)]
>       end do
>       call KSPSetTolerances(ksp_local,KSSiter%tol,PETSC_DEFAULT_REAL, &
>      &     PETSC_DEFAULT_REAL,KSSiter%maxit,ierr)
> …

     This code looks essentially right. You should call with -ksp_view to see 
exactly what PETSc is using for a solver.

>
> I understand that the parallel performance may not be comparable, so I first 
> set up a one-process test (with MPIAij, but all the rows are local since 
> there is only one process). The system is solved without any problem 
> (identical results within error). But the performance is actually a lot worse 
> (code built without debugging flags in performance tests) than my own 
> home-brew implementation in Fortran (I wrote my own ILU0 in CSR sparse matrix 
> format), which is hard to believe. I suspect the difference is from the PC as 
> the PETSc version took much more BiCGstab iterations (60-ish vs 100-ish) to 
> converge to the same relative tol.

   PETSc uses GMRES by default with a restart of 30 and left preconditioning. 
It could be different exact choices in the solver (which is why -ksp_view is so 
useful) can explain the differences in the runs between your code and PETSc's
>
> This is further confirmed when I change the setup of D-ILU (using 6 or 9 
> blocks instead of 3). While my Fortran/Matlab codes see minimal performance 
> difference (<5%) when I play with the D-ILU setup, increasing the number of 
> D-ILU blocks from 3 to 9 caused the ksp setup with PCBJACOBI to suffer a 
> performance decrease of more than 50% in sequential test.

   This is odd, the more blocks the smaller each block so the quicker the ILU 
set up should be. You can run various cases with -log_view and send us the 
output to see what is happening at each part of the computation in time.

> So my implementation IS somewhat different in PETSc. Do I miss something in 
> the PCBJACOBI setup? Or do I have some fundamental misunderstanding of how 
> PCBJACOBI works in PETSc?

   Probably not.
>
> If this is not the right way to implement a block diagonal ILU as (parallel) 
> PC, please kindly point me to the right direction. I searched through the 
> mail list to find some answers, only to find a couple of similar questions... 
> An example would be nice.

   You approach seems fundamentally right but I cannot be sure of possible bugs.
>
> On the other hand, does PETSc support a simple way to use explicit L/U matrix 
> as a preconditioner? I can import the  D-ILU matrix (I already converted my A 
> matrix into Mat) constructed in my Fortran code to make a better comparison. 
> Or do I have to construct my own PC using PCshell? If so, is there a good 
> tutorial/example to learn about how to use PCSHELL (in a more advanced sense, 
> like how to setup pc side and transpose)?

   Not sure what you mean by explicit L/U matrix as a preconditioner. As Hong 
said, yes you can use a parallel LU from MUMPS or SuperLU_DIST or Pastix as the 
solver. You do not need any shell code. You simply need to set the PCType to lu

   You can also set all this options from the command line and don't need to 
write the code specifically. So call KSPSetFromOptions() and then for example

    -pc_type bjacobi  -pc_bjacobi_local_blocks 3 -pc_sub_type ilu (this last 
one is applied to each block so you could use -pc_type lu and it would use lu 
on each block.)

   -ksp_type_none  -pc_type lu -pc_factor_mat_solver_type mumps  (do parallel 
LU with mumps)

By not hardwiring in the code and just using options you can test out different 
cases much quicker

Use -ksp_view to make sure that is using the solver the way you expect.

Barry



   Barry

>
> Thanks in advance,
>
> Hao

Reply via email to