Actually, I was using TR with MUMPS before I tried GAMG and it was working pretty well. The reason I want to switch to GAMG is that I have to increase my system size, and the simulation just takes too long with MUMPS and doesn't scale well.
On Wed, Sep 14, 2016 at 3:09 PM, Harshad Sahasrabudhe <[email protected]> wrote: > Hi Barry, > > Sorry, I had no idea that Newton TR would have anything to do with the > linear solver. I was using TR because it is trustworthy and converges every > time. I tried LS just now with GAMG+GMRES and it converges, so I don't have > the CONVERGED_STEP_LENGTH problem anymore. > > Thanks for you help! > > Harshad > > On Wed, Sep 14, 2016 at 2:51 PM, Barry Smith <[email protected]> wrote: > >> >> Oh, you are using SNESSolve_NEWTONTR ! >> >> Now it all makes sense! The trust region methods impose other >> conditions on the linear solution so it needs to have their own special >> convergence test. In particular it requires that the solution from the >> linear solve be inside the trust region (i.e. not too big) >> >> So, any particular reason you use SNESSolve_NEWTONTR instead of the >> default line search NEWTONLS? In general unless you have a good reason I >> recommend just using the NEWTONLS. >> >> If you really want to use the TR then the "early" return from the >> linear solve is expected, it is controlled by the trust region size and is >> not under user control. >> >> I'm sorry I was so brain dead and did not realize you had be using TR >> or we could have resolved this much sooner. >> >> Barry >> >> >> >> > On Sep 14, 2016, at 1:31 PM, Harshad Sahasrabudhe <[email protected]> >> wrote: >> > >> > Thanks. I now put a watchpoint on >> > >> > watch *( (PetscErrorCode (**)(KSP, PetscInt, PetscReal, >> KSPConvergedReason *, void *)) &(ksp->converged) ) >> > >> > The function pointer changes in the first iteration of SNES. It changed >> at the following place: >> > >> > Old value = >> > (PetscErrorCode (*)(KSP, PetscInt, PetscReal, KSPConvergedReason *, >> void *)) 0x2b54acdd00aa <KSPConvergedDefault(KSP, PetscInt, PetscReal, >> KSPConvergedReason*, void*)> >> > New value = >> > (PetscErrorCode (*)(KSP, PetscInt, PetscReal, KSPConvergedReason *, >> void *)) 0x2b54ad436ce8 <.SNES_TR_KSPConverged_Private >> .SNES_TR_KSPConverged_Private.SNES_TR_KSPConverged_Private(_p_KSP*, int, >> double, KSPConvergedReason*, void*)> >> > KSPSetConvergenceTest (ksp=0x22bf090, converge=0x2b54ad436ce8 >> <SNES_TR_KSPConverged_Private(_p_KSP*, int, double, KSPConvergedReason*, >> void*)>, cctx=0x1c8b3e0, >> > destroy=0x2b54ad437210 <SNES_TR_KSPConverged_Destroy(void*)>) >> > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/interface/itfunc.c:1768 >> > 1767 ksp->converged = converge; >> > >> > Here's the backtrace: >> > >> > #0 KSPSetConvergenceTest (ksp=0x22bf090, converge=0x2b54ad436ce8 >> <SNES_TR_KSPConverged_Private(_p_KSP*, int, double, KSPConvergedReason*, >> void*)>, cctx=0x1c8b3e0, >> > destroy=0x2b54ad437210 <SNES_TR_KSPConverged_Destroy(void*)>) >> > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/interface/itfunc.c:1768 >> > #1 0x00002b54ad43865a in SNESSolve_NEWTONTR (snes=0x1d9e490) at >> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/impls/tr/tr.c:146 >> > #2 0x00002b54acedab57 in SNESSolve (snes=0x1d9e490, b=0x0, x=0x1923420) >> > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/interface/snes.c:3743 >> > #3 0x00002b54abe8b780 in libMesh::PetscNonlinearSolver<double>::solve >> (this=0x19198c0, jac_in=..., x_in=..., r_in=...) at >> src/solvers/petsc_nonlinear_solver.C:714 >> > #4 0x00002b54abefa61d in libMesh::NonlinearImplicitSystem::solve >> (this=0x1910fe0) at src/systems/nonlinear_implicit_system.C:183 >> > #5 0x00002b54a5dcdceb in NonlinearPoisson::execute_solver >> (this=0x100c500) at NonlinearPoisson.cpp:1191 >> > #6 0x00002b54a5da733c in NonlinearPoisson::do_solve (this=0x100c500) >> at NonlinearPoisson.cpp:948 >> > #7 0x00002b54a6423785 in Simulation::solve (this=0x100c500) at >> Simulation.cpp:781 >> > #8 0x00002b54a634826e in Nemo::run_simulations (this=0x63b020) at >> Nemo.cpp:1313 >> > #9 0x0000000000426d0d in main (argc=6, argv=0x7ffcdb910768) at >> main.cpp:447 >> > >> > Thanks, >> > Harshad >> > >> > >> > On Wed, Sep 14, 2016 at 2:07 PM, Barry Smith <[email protected]> >> wrote: >> > >> > Super strange, it should never have switched to the function >> SNES_TR_KSPConverged_Private >> > >> > Fortunately you can use the same technique to track down where the >> function pointer changes. Just watch ksp->converged to see when the >> function pointer gets changed and send back the new stack trace. >> > >> > Barry >> > >> > >> > >> > >> > > On Sep 14, 2016, at 12:39 PM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > >> > > Hi Barry, >> > > >> > > I put a watchpoint on *((KSP_CONVERGED_REASON*) &( >> ((_p_KSP*)ksp)->reason )) in gdb. The ksp->reason switched between: >> > > >> > > Old value = KSP_CONVERGED_ITERATING >> > > New value = KSP_CONVERGED_RTOL >> > > 0x00002b143054bef2 in KSPConvergedDefault (ksp=0x23c3090, n=12, >> rnorm=5.3617149831259514e-08, reason=0x23c3310, ctx=0x2446210) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/interface/iterativ.c:764 >> > > 764 *reason = KSP_CONVERGED_RTOL; >> > > >> > > and >> > > >> > > Old value = KSP_CONVERGED_RTOL >> > > New value = KSP_CONVERGED_ITERATING >> > > KSPSetUp (ksp=0x23c3090) at /depot/ncn/apps/conte/conte-gc >> c-petsc35-dbg/libs/petsc/build-real/src/ksp/ksp/interface/itfunc.c:226 >> > > 226 if (!((PetscObject)ksp)->type_name) { >> > > >> > > However, after iteration 6, it changed to KSP_CONVERGED_STEP_LENGTH >> > > >> > > Old value = KSP_CONVERGED_ITERATING >> > > New value = KSP_CONVERGED_STEP_LENGTH >> > > SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1, >> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/impls/tr/tr.c:36 >> > > 36 PetscFunctionReturn(0); >> > > >> > > Any ideas why that function was executed? Backtrace when the program >> stopped here: >> > > >> > > #0 SNES_TR_KSPConverged_Private (ksp=0x23c3090, n=1, >> rnorm=0.097733468578376406, reason=0x23c3310, cctx=0x1d8f3e0) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/impls/tr/tr.c:36 >> > > #1 0x00002b14305d3fda in KSPGMRESCycle (itcount=0x7ffdcf2d4ffc, >> ksp=0x23c3090) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/impls/gmres/gmres.c:182 >> > > #2 0x00002b14305d4711 in KSPSolve_GMRES (ksp=0x23c3090) at >> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/impls/gmres/gmres.c:235 >> > > #3 0x00002b1430526a8a in KSPSolve (ksp=0x23c3090, b=0x1a916c0, >> x=0x1d89dc0) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/ksp/ksp/interface/itfunc.c:460 >> > > #4 0x00002b1430bb3905 in SNESSolve_NEWTONTR (snes=0x1ea2490) at >> /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/impls/tr/tr.c:160 >> > > #5 0x00002b1430655b57 in SNESSolve (snes=0x1ea2490, b=0x0, >> x=0x1a27420) >> > > at /depot/ncn/apps/conte/conte-gcc-petsc35-dbg/libs/petsc/build >> -real/src/snes/interface/snes.c:3743 >> > > #6 0x00002b142f606780 in libMesh::PetscNonlinearSolver<double>::solve >> (this=0x1a1d8c0, jac_in=..., x_in=..., r_in=...) at >> src/solvers/petsc_nonlinear_solver.C:714 >> > > #7 0x00002b142f67561d in libMesh::NonlinearImplicitSystem::solve >> (this=0x1a14fe0) at src/systems/nonlinear_implicit_system.C:183 >> > > #8 0x00002b1429548ceb in NonlinearPoisson::execute_solver >> (this=0x1110500) at NonlinearPoisson.cpp:1191 >> > > #9 0x00002b142952233c in NonlinearPoisson::do_solve (this=0x1110500) >> at NonlinearPoisson.cpp:948 >> > > #10 0x00002b1429b9e785 in Simulation::solve (this=0x1110500) at >> Simulation.cpp:781 >> > > #11 0x00002b1429ac326e in Nemo::run_simulations (this=0x63b020) at >> Nemo.cpp:1313 >> > > #12 0x0000000000426d0d in main (argc=6, argv=0x7ffdcf2d7908) at >> main.cpp:447 >> > > >> > > >> > > Thanks! >> > > Harshad >> > > >> > > On Wed, Sep 14, 2016 at 10:10 AM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > I think I found the problem. I configured PETSc with COPTFLAGS=-O3. >> I'll remove that option and try again. >> > > >> > > Thanks! >> > > Harshad >> > > >> > > On Wed, Sep 14, 2016 at 10:06 AM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > Hi Barry, >> > > >> > > Thanks for your inputs. I tried to set a watchpoint on >> ((_p_KSP*)ksp)->reason, but gdb says no symbol _p_KSP in context. >> Basically, GDB isn't able to find the PETSc source code. I built PETSc with >> --with-debugging=1 statically and -fPIC, but it seems the libpetsc.a I get >> doesn't contain debugging symbols (checked using objdump -g). How do I get >> PETSc library to have debugging info? >> > > >> > > Thanks, >> > > Harshad >> > > >> > > On Tue, Sep 13, 2016 at 2:47 PM, Barry Smith <[email protected]> >> wrote: >> > > >> > > > On Sep 13, 2016, at 1:01 PM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > > >> > > > Hi Barry, >> > > > >> > > > I compiled with mpich configured using --enable-g=meminit to get >> rid of MPI errors in Valgrind. Doing this reduced the number of errors to >> 2. I have attached the Valgrind output. >> > > >> > > This isn't helpful but it seems not to be a memory corruption >> issue :-( >> > > > >> > > > I'm using GAMG+GMRES for in each linear iteration of SNES. The >> linear solver converges with CONVERGED_RTOL for the first 6 iterations and >> with CONVERGED_STEP_LENGTH after that. I'm still very confused about why >> this is happening. Any thoughts/ideas? >> > > >> > > Does this happen on one process? If so I would run in the debugger >> and track the variable to see everyplace the variable is changed, this >> would point to exactly what piece of code is changing the variable to this >> unexpected value. >> > > >> > > For example with lldb one can use watch >> http://lldb.llvm.org/tutorial.html to see each time a variable gets >> changed. Similar thing with gdb. >> > > >> > > The variable to watch is ksp->reason Once you get the hang of >> this it can take just a few minutes to track down the code that is making >> this unexpected value, though I understand if you haven't done it before it >> can be intimidating. >> > > >> > > Barry >> > > >> > > You can do the same thing in parallel (like on two processes) if you >> need to but it is more cumbersome since you need run multiple debuggers. >> You can have PETSc start up multiple debuggers with mpiexec -n 2 ./ex >> -start_in_debugger >> > > >> > > >> > > >> > > >> > > > >> > > > Thanks, >> > > > Harshad >> > > > >> > > > On Thu, Sep 8, 2016 at 11:26 PM, Barry Smith <[email protected]> >> wrote: >> > > > >> > > > Install your MPI with --download-mpich as a PETSc ./configure >> option, this will eliminate all the MPICH valgrind errors. Then send as an >> attachment the resulting valgrind file. >> > > > >> > > > I do not 100 % trust any code that produces such valgrind errors. >> > > > >> > > > Barry >> > > > >> > > > >> > > > >> > > > > On Sep 8, 2016, at 10:12 PM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > > > >> > > > > Hi Barry, >> > > > > >> > > > > Thanks for the reply. My code is in C. I ran with Valgrind and >> found many "Conditional jump or move depends on uninitialized value(s)", >> "Invalid read" and "Use of uninitialized value" errors. I think all of them >> are from the libraries I'm using (LibMesh, Boost, MPI, etc.). I'm not >> really sure what I'm looking for in the Valgrind output. At the end of the >> file, I get: >> > > > > >> > > > > ==40223== More than 10000000 total errors detected. I'm not >> reporting any more. >> > > > > ==40223== Final error counts will be inaccurate. Go fix your >> program! >> > > > > ==40223== Rerun with --error-limit=no to disable this cutoff. >> Note >> > > > > ==40223== that errors may occur in your program without prior >> warning from >> > > > > ==40223== Valgrind, because errors are no longer being displayed. >> > > > > >> > > > > Can you give some suggestions on how I should proceed? >> > > > > >> > > > > Thanks, >> > > > > Harshad >> > > > > >> > > > > On Thu, Sep 8, 2016 at 1:59 PM, Barry Smith <[email protected]> >> wrote: >> > > > > >> > > > > This is very odd. CONVERGED_STEP_LENGTH for KSP is very >> specialized and should never occur with GMRES. >> > > > > >> > > > > Can you run with valgrind to make sure there is no memory >> corruption? http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >> > > > > >> > > > > Is your code fortran or C? >> > > > > >> > > > > Barry >> > > > > >> > > > > > On Sep 8, 2016, at 10:38 AM, Harshad Sahasrabudhe < >> [email protected]> wrote: >> > > > > > >> > > > > > Hi, >> > > > > > >> > > > > > I'm using GAMG + GMRES for my Poisson problem. The solver >> converges with KSP_CONVERGED_STEP_LENGTH at a residual of >> 9.773346857844e-02, which is much higher than what I need (I need a >> tolerance of at least 1E-8). I am not able to figure out which tolerance I >> need to set to avoid convergence due to CONVERGED_STEP_LENGTH. >> > > > > > >> > > > > > Any help is appreciated! Output of -ksp_view and -ksp_monitor: >> > > > > > >> > > > > > 0 KSP Residual norm 3.121347818142e+00 >> > > > > > 1 KSP Residual norm 9.773346857844e-02 >> > > > > > Linear solve converged due to CONVERGED_STEP_LENGTH >> iterations 1 >> > > > > > KSP Object: 1 MPI processes >> > > > > > type: gmres >> > > > > > GMRES: restart=30, using Classical (unmodified) >> Gram-Schmidt Orthogonalization with no iterative refinement >> > > > > > GMRES: happy breakdown tolerance 1e-30 >> > > > > > maximum iterations=10000, initial guess is zero >> > > > > > tolerances: relative=1e-08, absolute=1e-50, divergence=10000 >> > > > > > left preconditioning >> > > > > > using PRECONDITIONED norm type for convergence test >> > > > > > PC Object: 1 MPI processes >> > > > > > type: gamg >> > > > > > MG: type is MULTIPLICATIVE, levels=2 cycles=v >> > > > > > Cycles per PCApply=1 >> > > > > > Using Galerkin computed coarse grid matrices >> > > > > > Coarse grid solver -- level ------------------------------- >> > > > > > KSP Object: (mg_coarse_) 1 MPI processes >> > > > > > type: preonly >> > > > > > maximum iterations=1, initial guess is zero >> > > > > > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000 >> > > > > > left preconditioning >> > > > > > using NONE norm type for convergence test >> > > > > > PC Object: (mg_coarse_) 1 MPI processes >> > > > > > type: bjacobi >> > > > > > block Jacobi: number of blocks = 1 >> > > > > > Local solve is same for all blocks, in the following >> KSP and PC objects: >> > > > > > KSP Object: (mg_coarse_sub_) 1 MPI >> processes >> > > > > > type: preonly >> > > > > > maximum iterations=1, initial guess is zero >> > > > > > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000 >> > > > > > left preconditioning >> > > > > > using NONE norm type for convergence test >> > > > > > PC Object: (mg_coarse_sub_) 1 MPI >> processes >> > > > > > type: lu >> > > > > > LU: out-of-place factorization >> > > > > > tolerance for zero pivot 2.22045e-14 >> > > > > > using diagonal shift on blocks to prevent zero >> pivot [INBLOCKS] >> > > > > > matrix ordering: nd >> > > > > > factor fill ratio given 5, needed 1.91048 >> > > > > > Factored matrix follows: >> > > > > > Mat Object: 1 MPI processes >> > > > > > type: seqaij >> > > > > > rows=284, cols=284 >> > > > > > package used to perform factorization: petsc >> > > > > > total: nonzeros=7726, allocated nonzeros=7726 >> > > > > > total number of mallocs used during >> MatSetValues calls =0 >> > > > > > using I-node routines: found 133 nodes, >> limit used is 5 >> > > > > > linear system matrix = precond matrix: >> > > > > > Mat Object: 1 MPI processes >> > > > > > type: seqaij >> > > > > > rows=284, cols=284 >> > > > > > total: nonzeros=4044, allocated nonzeros=4044 >> > > > > > total number of mallocs used during MatSetValues >> calls =0 >> > > > > > not using I-node routines >> > > > > > linear system matrix = precond matrix: >> > > > > > Mat Object: 1 MPI processes >> > > > > > type: seqaij >> > > > > > rows=284, cols=284 >> > > > > > total: nonzeros=4044, allocated nonzeros=4044 >> > > > > > total number of mallocs used during MatSetValues calls >> =0 >> > > > > > not using I-node routines >> > > > > > Down solver (pre-smoother) on level 1 >> ------------------------------- >> > > > > > KSP Object: (mg_levels_1_) 1 MPI processes >> > > > > > type: chebyshev >> > > > > > Chebyshev: eigenvalue estimates: min = 0.195339, max = >> 4.10212 >> > > > > > maximum iterations=2 >> > > > > > tolerances: relative=1e-05, absolute=1e-50, >> divergence=10000 >> > > > > > left preconditioning >> > > > > > using nonzero initial guess >> > > > > > using NONE norm type for convergence test >> > > > > > PC Object: (mg_levels_1_) 1 MPI processes >> > > > > > type: sor >> > > > > > SOR: type = local_symmetric, iterations = 1, local >> iterations = 1, omega = 1 >> > > > > > linear system matrix = precond matrix: >> > > > > > Mat Object: () 1 MPI processes >> > > > > > type: seqaij >> > > > > > rows=9036, cols=9036 >> > > > > > total: nonzeros=192256, allocated nonzeros=192256 >> > > > > > total number of mallocs used during MatSetValues calls >> =0 >> > > > > > not using I-node routines >> > > > > > Up solver (post-smoother) same as down solver (pre-smoother) >> > > > > > linear system matrix = precond matrix: >> > > > > > Mat Object: () 1 MPI processes >> > > > > > type: seqaij >> > > > > > rows=9036, cols=9036 >> > > > > > total: nonzeros=192256, allocated nonzeros=192256 >> > > > > > total number of mallocs used during MatSetValues calls =0 >> > > > > > not using I-node routines >> > > > > > >> > > > > > Thanks, >> > > > > > Harshad >> > > > > >> > > > > >> > > > >> > > > >> > > > <valgrind.log.33199> >> > > >> > > >> > > >> > > >> > >> > >> >> >
