On 30 May 2013 17:18, "Jan Blechta" <[email protected]> wrote:
>
> Johannes, now I see that only buildbots having old OpenMPI 1.4.3, as I
> have, does not run regression test in parallel. Is it the reason? Was
>   mpirun -n 3 demo_navier-stokes
> failing with PETSc / Hypre Boomeramg error and deadlocking?

Yes, we have disabled parallel testing on some of the buildbots because of
this.

Johannes

> Jan
>
>
> On Thu, 30 May 2013 15:47:12 +0200
> Jan Blechta <[email protected]> wrote:
> > I observed that Boomeramg eventually fails when running on 3,5,6 or 7
> > processes. When using 1,2,4,8 processes, it is ok. Strange enough is
> > that nobody saw it but me as I can reproduce it very easily
> >   $np=3 # or 5,6,7
> >   $export DOLFIN_NOPLOT=1
> >   $mpirun -n $np demo_navier-stokes
> > with FEniCS 1.0.0, PETSc 3.2 and with FEniCS dev, PETSc 3.4. After few
> > timesteps PETSc fails and DOLFIN deadlocks.
> >
> > PETSc throws in this demo when solving projection step, i.e. Poisson
> > problem, with both Dirichlet and zero Neumann condition, discretized
> > by piecewise-linears on triangles.
> >
> > Regarding effort to reproduce it with PETSc directly, Jed, I was able
> > to dump this specific matrix to binary format but not vector, so I
> > need to obtain somehow binary vector - is somewhere documentation of
> > that binary format?
> >
> > I guess I would need to recompile PETSc in some debug mode to break
> > into Hypre, is it so? This is backtrace from process printing PETSc
> > ERROR:
> >
__________________________________________________________________________
> > #0  0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6,
> > timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 #1
> > 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 #2
> > 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 #3
> > 0x00007fffed0b9221 in opal_progress () from /usr/lib/libopen-pal.so.0
> > #4  0x00007ffff1b593d5 in ?? () from /usr/lib/libmpi.so.0
> > #5  0x00007ffff1b8a1c5 in PMPI_Waitany () from /usr/lib/libmpi.so.0
> > #6  0x00007ffff2f5c43e in VecScatterEnd_1 ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #7  0x00007ffff2f57811 in VecScatterEnd ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #8  0x00007ffff2f3cb9a in VecGhostUpdateEnd ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #9  0x00007ffff74ecdea in dolfin::Assembler::assemble
> > (this=0x7fffffff9da0,
> >     A=..., a=...)
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/Assembler.cpp:96
> > #10 0x00007ffff74e8095 in dolfin::assemble (A=..., a=...)
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/assemble.cpp:38
> > #11 0x0000000000425d41 in main ()
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/demo/pde/navier-stokes/cpp/main.cpp:180
> >
_________________________________________________________________________________________
> >
> >
> > This is backtrace from one deadlocked process:
> > ______________________________________________________________________
> > #0  0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6,
> >     timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
> > #1  0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0
> > #2  0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0
> > #3  0x00007fffed0b9221 in opal_progress ()
> > from /usr/lib/libopen-pal.so.0 #4  0x00007fffdb131a1d in ?? ()
> >    from /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so
> > #5  0x00007fffd9220db9 in ?? ()
> >    from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so
> > #6  0x00007ffff1b6dee9 in PMPI_Allreduce () from /usr/lib/libmpi.so.0
> > #7  0x00007ffff2e7aa74 in PetscSplitOwnership ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #8  0x00007ffff2eee129 in PetscLayoutSetUp ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #9  0x00007ffff2f31cf7 in VecCreate_MPI_Private ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #10 0x00007ffff2f32092 in VecCreate_MPI ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #11 0x00007ffff2f234f7 in VecSetType ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #12 0x00007ffff2f32708 in VecCreate_Standard ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #13 0x00007ffff2f234f7 in VecSetType ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > ---Type <return> to continue, or q <return> to quit---
> > #14 0x00007ffff2fb75a1 in MatGetVecs ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #15 0x00007ffff335fdc6 in PCSetUp_HYPRE ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #16 0x00007ffff3362cd6 in PCSetUp ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #17 0x00007ffff33f676e in KSPSetUp ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #18 0x00007ffff33f7bfe in KSPSolve ()
> >    from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so
> > #19 0x00007ffff77082f4 in dolfin::PETScKrylovSolver::solve
> > (this=0x9700f0, x= ..., b=...)
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:445
> > #20 0x00007ffff7709228 in dolfin::PETScKrylovSolver::solve
> > (this=0x9700f0, A=..., x=..., b=...)
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:491
> > #21 0x00007ffff76d9303 in dolfin::KrylovSolver::solve (this=0x94a8e0,
> > A=..., x=..., b=...)
> >     at
/usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/KrylovSolver.cpp:147
> > #22 0x00007ffff76f4b91 in dolfin::LinearSolver::solve
> > (this=0x7fffffff9c40, A=..., x=..., b=...)
> >
_____________________________________________________________________________________
> >
> >
> > On Wed, 29 May 2013 11:19:53 -0500
> > Jed Brown <[email protected]> wrote:
> > > Jan Blechta <[email protected]> writes:
> > >
> > > > Maybe this is PETSc stack from previous time step - this is
> > > > provided by DOLFIN.
> > > >
> > > >> Maybe you aren't checking error codes and try to do something
> > > >> else collective?
> > > >
> > > > I don't know, I'm just using FEniCS.
> > >
> > > When I said "you", I was addressing the list in general, which
> > > includes FEniCS developers.
> > >
> > > >> > [2]PETSC ERROR: PCDestroy() line 121
> > > >> > in /petsc-3.4.0/src/ksp/pc/interface/precon.c [2]PETSC ERROR:
> > > >> > KSPDestroy() line 788
> > > >> > in /petsc-3.4.0/src/ksp/ksp/interface/itfunc.c
> > > >> >
> > > >> > and deadlocks. Did you seen it before? Where can be the
> > > >> > problem?
> > > >>
> > > >> Deadlock must be back in your code.  This error occurs on
> > > >> PETSC_COMM_SELF, which means we have no way to ensure that the
> > > >> error condition is collective.  You can't just go calling other
> > > >> collective functions after such an error.
> > > >
> > > > This means that DOLFIN handles poorly some error condition.
> > >
> > > It appears that way, but that appears to be independent of whatever
> > > causes Hypre to return an error.
> > >
> > > >> Anyway, please set up a reproducible test case and/or get a trace
> > > >> from inside Hypre.  It will be useful for them to debug the
> > > >> problem.
> > > >
> > > > I'm not PETSc user so it would be quite time-consuming for me to
> > > > try to reproduce it without FEniCS. I will try at least get a
> > > > trace.
> > >
> > > You can try dumping the matrix using '-ksp_view_mat binary' (writes
> > > 'binaryoutput'), for example, then try solving it using a PETSc
> > > example, e.g. src/ksp/ksp/examples/tutorials/ex10.c with the same
> > > configuration via run-time options.
> > > _______________________________________________
> > > fenics mailing list
> > > [email protected]
> > > http://fenicsproject.org/mailman/listinfo/fenics
> >
> > _______________________________________________
> > fenics mailing list
> > [email protected]
> > http://fenicsproject.org/mailman/listinfo/fenics
>
_______________________________________________
fenics mailing list
[email protected]
http://fenicsproject.org/mailman/listinfo/fenics

Reply via email to