$ python
Python 2.7.13 (default, Dec 18 2016, 07:03:39) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
MatMult >>> 3.6272e+03 - 2.0894e+03
1537.7999999999997
KSP Solve >>> 3.6329e+03 - 2.0949e+03 
1538.0
>>> 

You are right, all the extra time is within the MatMult() so for some reason 
your shell mat mult is much slower. I cannot guess why unless I can see inside 
your shell matmult at what it is doing.


Make sure your configure options are identical and using same compiler.

  Barry





> On Sep 15, 2017, at 5:08 AM, Federico Golfrè Andreasi 
> <[email protected]> wrote:
> 
> Hi Barry,
> 
> I have attached an extract of the our program output for both the versions: 
> PETSc-3.4.4 and PETSc-3.7.3.
> 
> In this program the KSP has a shell matrix as operator and a MPIAIJ matrix as 
> pre-conditioner.
> I was wondering if the slowing is related more on the operations done in the 
> MatMult of the shell matrix;
> because on a test program where I solve a similar system without shell matrix 
> I do not see the performance degradation.
> 
> Perhaps you could give me some hints,
> Thank you and best regards,
> Federico
> 
> 
> 
> 
> On 13 September 2017 at 17:58, Barry Smith <[email protected]> wrote:
> 
> > On Sep 13, 2017, at 10:56 AM, Federico Golfrè Andreasi 
> > <[email protected]> wrote:
> >
> > Hi Barry,
> >
> > I understand and perfectly agree with you that the behavior increase after 
> > the release due to better tuning.
> >
> > In my case, the difference in the solution is negligible, but the runtime 
> > increases up to +70% (with the same number of ksp_iterations).
> 
>   Ok this is an important (and bad) difference.
> 
> > So I was wondering if maybe there were just some flags related to memory 
> > preallocation or re-usage of intermediate solution that before was 
> > defaulted.
> 
>    Note likely it is this.
> 
>    Are both compiled with the same level of compiler optimization?
> 
>    Please run both with -log_summary and send the output, this will tell us 
> WHAT parts are now slower.
> 
>   Barry
> 
> >
> > Thank you,
> > Federico
> >
> >
> >
> > On 13 September 2017 at 17:29, Barry Smith <[email protected]> wrote:
> >
> >    There will likely always be slight differences in convergence over that 
> > many releases. Lots of little defaults etc get changed over time as we 
> > learn from users and increase the robustness of the defaults.
> >
> >     So in your case do the differences matter?
> >
> > 1) What is the time to solution in both cases, is it a few percent 
> > different or now much slower?
> >
> > 2) What about number of iterations? Almost identical (say 1 or 2 different) 
> > or does it now take 30 iterations when it use to take 5?
> >
> >   Barry
> >
> > > On Sep 13, 2017, at 10:25 AM, Federico Golfrè Andreasi 
> > > <[email protected]> wrote:
> > >
> > > Dear PETSc users/developers,
> > >
> > > I recently switched from PETSc-3.4 to PETSc-3.7 and found that some 
> > > default setup for the "mg" (mutigrid) preconditioner have changed.
> > >
> > > We were solving a linear system passing, throug command line, the 
> > > following options:
> > > -ksp_type      fgmres
> > > -ksp_max_it    100000
> > > -ksp_rtol      0.000001
> > > -pc_type       mg
> > > -ksp_view
> > >
> > > The output of the KSP view is as follow:
> > >
> > > KSP Object: 128 MPI processes
> > >   type: fgmres
> > >     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt 
> > > Orthogonalization with no iterative refinement
> > >     GMRES: happy breakdown tolerance 1e-30
> > >   maximum iterations=100000, initial guess is zero
> > >   tolerances:  relative=1e-06, absolute=1e-50, divergence=10000
> > >   right preconditioning
> > >   using UNPRECONDITIONED norm type for convergence test
> > > PC Object: 128 MPI processes
> > >   type: mg
> > >     MG: type is MULTIPLICATIVE, levels=1 cycles=v
> > >       Cycles per PCApply=1
> > >       Not using Galerkin computed coarse grid matrices
> > >   Coarse grid solver -- level -------------------------------
> > >     KSP Object:    (mg_levels_0_)     128 MPI processes
> > >       type: chebyshev
> > >         Chebyshev: eigenvalue estimates:  min = 0.223549, max = 2.45903
> > >         Chebyshev: estimated using:  [0 0.1; 0 1.1]
> > >         KSP Object:        (mg_levels_0_est_)         128 MPI processes
> > >           type: gmres
> > >             GMRES: restart=30, using Classical (unmodified) Gram-Schmidt 
> > > Orthogonalization with no iterative refinement
> > >             GMRES: happy breakdown tolerance 1e-30
> > >           maximum iterations=10, initial guess is zero
> > >           tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> > >           left preconditioning
> > >           using NONE norm type for convergence test
> > >         PC Object:        (mg_levels_0_)         128 MPI processes
> > >           type: sor
> > >             SOR: type = local_symmetric, iterations = 1, local iterations 
> > > = 1, omega = 1
> > >           linear system matrix followed by preconditioner matrix:
> > >           Matrix Object:           128 MPI processes
> > >             type: mpiaij
> > >             rows=279669, cols=279669
> > >             total: nonzeros=6427943, allocated nonzeros=6427943
> > >             total number of mallocs used during MatSetValues calls =0
> > >               not using I-node (on process 0) routines
> > >           Matrix Object:           128 MPI processes
> > >             type: mpiaij
> > >             rows=279669, cols=279669
> > >             total: nonzeros=6427943, allocated nonzeros=6427943
> > >             total number of mallocs used during MatSetValues calls =0
> > >               not using I-node (on process 0) routines
> > >       maximum iterations=1, initial guess is zero
> > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000
> > >       left preconditioning
> > >       using NONE norm type for convergence test
> > >     PC Object:    (mg_levels_0_)     128 MPI processes
> > >       type: sor
> > >         SOR: type = local_symmetric, iterations = 1, local iterations = 
> > > 1, omega = 1
> > >       linear system matrix followed by preconditioner matrix:
> > >       Matrix Object:       128 MPI processes
> > >         type: mpiaij
> > >         rows=279669, cols=279669
> > >         total: nonzeros=6427943, allocated nonzeros=6427943
> > >         total number of mallocs used during MatSetValues calls =0
> > >           not using I-node (on process 0) routines
> > >       Matrix Object:       128 MPI processes
> > >         type: mpiaij
> > >         rows=279669, cols=279669
> > >         total: nonzeros=6427943, allocated nonzeros=6427943
> > >         total number of mallocs used during MatSetValues calls =0
> > >           not using I-node (on process 0) routines
> > >   linear system matrix followed by preconditioner matrix:
> > >   Matrix Object:   128 MPI processes
> > >     type: mpiaij
> > >     rows=279669, cols=279669
> > >     total: nonzeros=6427943, allocated nonzeros=6427943
> > >     total number of mallocs used during MatSetValues calls =0
> > >       not using I-node (on process 0) routines
> > >   Matrix Object:   128 MPI processes
> > >     type: mpiaij
> > >     rows=279669, cols=279669
> > >     total: nonzeros=6427943, allocated nonzeros=6427943
> > >     total number of mallocs used during MatSetValues calls =0
> > >       not using I-node (on process 0) routines
> > >
> > > When I build the same program using PETSc-3.7 and run it with the same 
> > > options we observe that the runtime increases and the convergence is 
> > > slightly different. The output of the KSP view is:
> > >
> > > KSP Object: 128 MPI processes
> > >   type: fgmres
> > >     GMRES: restart=30, using Classical (unmodified) Gram-Schmidt 
> > > Orthogonalization with no iterative refinement
> > >     GMRES: happy breakdown tolerance 1e-30
> > >   maximum iterations=100000, initial guess is zero
> > >   tolerances:  relative=1e-06, absolute=1e-50, divergence=10000.
> > >   right preconditioning
> > >   using UNPRECONDITIONED norm type for convergence test
> > > PC Object: 128 MPI processes
> > >   type: mg
> > >     MG: type is MULTIPLICATIVE, levels=1 cycles=v
> > >       Cycles per PCApply=1
> > >       Not using Galerkin computed coarse grid matrices
> > >   Coarse grid solver -- level -------------------------------
> > >     KSP Object:    (mg_levels_0_)     128 MPI processes
> > >       type: chebyshev
> > >         Chebyshev: eigenvalue estimates:  min = 0.223549, max = 2.45903
> > >         Chebyshev: eigenvalues estimated using gmres with translations  
> > > [0. 0.1; 0. 1.1]
> > >         KSP Object:        (mg_levels_0_esteig_)         128 MPI processes
> > >           type: gmres
> > >             GMRES: restart=30, using Classical (unmodified) Gram-Schmidt 
> > > Orthogonalization with no iterative refinement
> > >             GMRES: happy breakdown tolerance 1e-30
> > >           maximum iterations=10, initial guess is zero
> > >           tolerances:  relative=1e-12, absolute=1e-50, divergence=10000.
> > >           left preconditioning
> > >           using PRECONDITIONED norm type for convergence test
> > >       maximum iterations=2, initial guess is zero
> > >       tolerances:  relative=1e-05, absolute=1e-50, divergence=10000.
> > >       left preconditioning
> > >       using NONE norm type for convergence test
> > >     PC Object:    (mg_levels_0_)     128 MPI processes
> > >       type: sor
> > >         SOR: type = local_symmetric, iterations = 1, local iterations = 
> > > 1, omega = 1.
> > >       linear system matrix followed by preconditioner matrix:
> > >       Mat Object:       128 MPI processes
> > >         type: mpiaij
> > >         rows=279669, cols=279669
> > >         total: nonzeros=6427943, allocated nonzeros=6427943
> > >         total number of mallocs used during MatSetValues calls =0
> > >           not using I-node (on process 0) routines
> > >       Mat Object:       128 MPI processes
> > >         type: mpiaij
> > >         rows=279669, cols=279669
> > >         total: nonzeros=6427943, allocated nonzeros=6427943
> > >         total number of mallocs used during MatSetValues calls =0
> > >           not using I-node (on process 0) routines
> > >   linear system matrix followed by preconditioner matrix:
> > >   Mat Object:   128 MPI processes
> > >     type: mpiaij
> > >     rows=279669, cols=279669
> > >     total: nonzeros=6427943, allocated nonzeros=6427943
> > >     total number of mallocs used during MatSetValues calls =0
> > >       not using I-node (on process 0) routines
> > >   Mat Object:   128 MPI processes
> > >     type: mpiaij
> > >     rows=279669, cols=279669
> > >     total: nonzeros=6427943, allocated nonzeros=6427943
> > >     total number of mallocs used during MatSetValues calls =0
> > >       not using I-node (on process 0) routines
> > >
> > > I was able to get a closer solution adding the following options:
> > > -mg_levels_0_esteig_ksp_norm_type   none
> > > -mg_levels_0_esteig_ksp_rtol        1.0e-5
> > > -mg_levels_ksp_max_it               1
> > >
> > > But I still can reach the same runtime we were observing with PETSc-3.4, 
> > > could you please advice me if I should specify any other options?
> > >
> > > Thank you very much for your support,
> > > Federico Golfre' Andreasi
> > >
> >
> >
> 
> 
> <run_petsc34.txt><run_petsc37.txt>

Reply via email to