Johannes, now I see that only buildbots having old OpenMPI 1.4.3, as I
have, does not run regression test in parallel. Is it the reason? Was
  mpirun -n 3 demo_navier-stokes
failing with PETSc / Hypre Boomeramg error and deadlocking?

Jan


On Thu, 30 May 2013 15:47:12 +0200
Jan Blechta <[email protected]> wrote:
> I observed that Boomeramg eventually fails when running on 3,5,6 or 7
> processes. When using 1,2,4,8 processes, it is ok. Strange enough is
> that nobody saw it but me as I can reproduce it very easily
>   $np=3 # or 5,6,7
>   $export DOLFIN_NOPLOT=1
>   $mpirun -n $np demo_navier-stokes
> with FEniCS 1.0.0, PETSc 3.2 and with FEniCS dev, PETSc 3.4. After few
> timesteps PETSc fails and DOLFIN deadlocks.
> 
> PETSc throws in this demo when solving projection step, i.e. Poisson
> problem, with both Dirichlet and zero Neumann condition, discretized
> by piecewise-linears on triangles.
> 
> Regarding effort to reproduce it with PETSc directly, Jed, I was able
> to dump this specific matrix to binary format but not vector, so I
> need to obtain somehow binary vector - is somewhere documentation of
> that binary format?
> 
> I guess I would need to recompile PETSc in some debug mode to break
> into Hypre, is it so? This is backtrace from process printing PETSc
> ERROR:
> __________________________________________________________________________
> #0  0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6,
> timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 #1
> 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 #2
> 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 #3
> 0x00007fffed0b9221 in opal_progress () from /usr/lib/libopen-pal.so.0
> #4  0x00007ffff1b593d5 in ?? () from /usr/lib/libmpi.so.0
> #5  0x00007ffff1b8a1c5 in PMPI_Waitany () from /usr/lib/libmpi.so.0
> #6  0x00007ffff2f5c43e in VecScatterEnd_1 ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #7  0x00007ffff2f57811 in VecScatterEnd ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #8  0x00007ffff2f3cb9a in VecGhostUpdateEnd ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #9  0x00007ffff74ecdea in dolfin::Assembler::assemble
> (this=0x7fffffff9da0, 
>     A=..., a=...)
>     at /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/Assembler.cpp:96
> #10 0x00007ffff74e8095 in dolfin::assemble (A=..., a=...)
>     at /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/assemble.cpp:38
> #11 0x0000000000425d41 in main ()
>     at 
> /usr/users/blechta/fenics/fenics/src/dolfin/demo/pde/navier-stokes/cpp/main.cpp:180
> _________________________________________________________________________________________
> 
> 
> This is backtrace from one deadlocked process:
> ______________________________________________________________________
> #0  0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6, 
>     timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
> #1  0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0
> #2  0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0
> #3  0x00007fffed0b9221 in opal_progress ()
> from /usr/lib/libopen-pal.so.0 #4  0x00007fffdb131a1d in ?? ()
>    from /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so
> #5  0x00007fffd9220db9 in ?? ()
>    from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
> #6  0x00007ffff1b6dee9 in PMPI_Allreduce () from /usr/lib/libmpi.so.0
> #7  0x00007ffff2e7aa74 in PetscSplitOwnership ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #8  0x00007ffff2eee129 in PetscLayoutSetUp ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #9  0x00007ffff2f31cf7 in VecCreate_MPI_Private ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #10 0x00007ffff2f32092 in VecCreate_MPI ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #11 0x00007ffff2f234f7 in VecSetType ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #12 0x00007ffff2f32708 in VecCreate_Standard ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #13 0x00007ffff2f234f7 in VecSetType ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> ---Type <return> to continue, or q <return> to quit---
> #14 0x00007ffff2fb75a1 in MatGetVecs ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #15 0x00007ffff335fdc6 in PCSetUp_HYPRE ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #16 0x00007ffff3362cd6 in PCSetUp ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #17 0x00007ffff33f676e in KSPSetUp ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #18 0x00007ffff33f7bfe in KSPSolve ()
>    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> #19 0x00007ffff77082f4 in dolfin::PETScKrylovSolver::solve
> (this=0x9700f0, x= ..., b=...)
>     at 
> /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:445
> #20 0x00007ffff7709228 in dolfin::PETScKrylovSolver::solve
> (this=0x9700f0, A=..., x=..., b=...)
>     at 
> /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:491
> #21 0x00007ffff76d9303 in dolfin::KrylovSolver::solve (this=0x94a8e0,
> A=..., x=..., b=...)
>     at 
> /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/KrylovSolver.cpp:147
> #22 0x00007ffff76f4b91 in dolfin::LinearSolver::solve
> (this=0x7fffffff9c40, A=..., x=..., b=...)
> _____________________________________________________________________________________
> 
> 
> On Wed, 29 May 2013 11:19:53 -0500
> Jed Brown <[email protected]> wrote:
> > Jan Blechta <[email protected]> writes:
> > 
> > > Maybe this is PETSc stack from previous time step - this is
> > > provided by DOLFIN.
> > >
> > >> Maybe you aren't checking error codes and try to do something
> > >> else collective?
> > >
> > > I don't know, I'm just using FEniCS.
> > 
> > When I said "you", I was addressing the list in general, which
> > includes FEniCS developers.
> > 
> > >> > [2]PETSC ERROR: PCDestroy() line 121
> > >> > in /petsc-3.4.0/src/ksp/pc/interface/precon.c [2]PETSC ERROR:
> > >> > KSPDestroy() line 788
> > >> > in /petsc-3.4.0/src/ksp/ksp/interface/itfunc.c
> > >> >
> > >> > and deadlocks. Did you seen it before? Where can be the
> > >> > problem?
> > >> 
> > >> Deadlock must be back in your code.  This error occurs on
> > >> PETSC_COMM_SELF, which means we have no way to ensure that the
> > >> error condition is collective.  You can't just go calling other
> > >> collective functions after such an error.
> > >
> > > This means that DOLFIN handles poorly some error condition.
> > 
> > It appears that way, but that appears to be independent of whatever
> > causes Hypre to return an error.
> > 
> > >> Anyway, please set up a reproducible test case and/or get a trace
> > >> from inside Hypre.  It will be useful for them to debug the
> > >> problem.
> > >
> > > I'm not PETSc user so it would be quite time-consuming for me to
> > > try to reproduce it without FEniCS. I will try at least get a
> > > trace.
> > 
> > You can try dumping the matrix using '-ksp_view_mat binary' (writes
> > 'binaryoutput'), for example, then try solving it using a PETSc
> > example, e.g. src/ksp/ksp/examples/tutorials/ex10.c with the same
> > configuration via run-time options.
> > _______________________________________________
> > fenics mailing list
> > [email protected]
> > http://fenicsproject.org/mailman/listinfo/fenics
> 
> _______________________________________________
> fenics mailing list
> [email protected]
> http://fenicsproject.org/mailman/listinfo/fenics

_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Reply via email to