Johannes, now I see that only buildbots having old OpenMPI 1.4.3, as I have, does not run regression test in parallel. Is it the reason? Was mpirun -n 3 demo_navier-stokes failing with PETSc / Hypre Boomeramg error and deadlocking?
Jan On Thu, 30 May 2013 15:47:12 +0200 Jan Blechta <[email protected]> wrote: > I observed that Boomeramg eventually fails when running on 3,5,6 or 7 > processes. When using 1,2,4,8 processes, it is ok. Strange enough is > that nobody saw it but me as I can reproduce it very easily > $np=3 # or 5,6,7 > $export DOLFIN_NOPLOT=1 > $mpirun -n $np demo_navier-stokes > with FEniCS 1.0.0, PETSc 3.2 and with FEniCS dev, PETSc 3.4. After few > timesteps PETSc fails and DOLFIN deadlocks. > > PETSc throws in this demo when solving projection step, i.e. Poisson > problem, with both Dirichlet and zero Neumann condition, discretized > by piecewise-linears on triangles. > > Regarding effort to reproduce it with PETSc directly, Jed, I was able > to dump this specific matrix to binary format but not vector, so I > need to obtain somehow binary vector - is somewhere documentation of > that binary format? > > I guess I would need to recompile PETSc in some debug mode to break > into Hypre, is it so? This is backtrace from process printing PETSc > ERROR: > __________________________________________________________________________ > #0 0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6, > timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 #1 > 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 #2 > 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 #3 > 0x00007fffed0b9221 in opal_progress () from /usr/lib/libopen-pal.so.0 > #4 0x00007ffff1b593d5 in ?? () from /usr/lib/libmpi.so.0 > #5 0x00007ffff1b8a1c5 in PMPI_Waitany () from /usr/lib/libmpi.so.0 > #6 0x00007ffff2f5c43e in VecScatterEnd_1 () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #7 0x00007ffff2f57811 in VecScatterEnd () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #8 0x00007ffff2f3cb9a in VecGhostUpdateEnd () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #9 0x00007ffff74ecdea in dolfin::Assembler::assemble > (this=0x7fffffff9da0, > A=..., a=...) > at /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/Assembler.cpp:96 > #10 0x00007ffff74e8095 in dolfin::assemble (A=..., a=...) > at /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/fem/assemble.cpp:38 > #11 0x0000000000425d41 in main () > at > /usr/users/blechta/fenics/fenics/src/dolfin/demo/pde/navier-stokes/cpp/main.cpp:180 > _________________________________________________________________________________________ > > > This is backtrace from one deadlocked process: > ______________________________________________________________________ > #0 0x00007ffff5caa2d8 in __GI___poll (fds=0x6d02c0, nfds=6, > timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83 > #1 0x00007fffed0c5ab0 in ?? () from /usr/lib/libopen-pal.so.0 > #2 0x00007fffed0c48ff in ?? () from /usr/lib/libopen-pal.so.0 > #3 0x00007fffed0b9221 in opal_progress () > from /usr/lib/libopen-pal.so.0 #4 0x00007fffdb131a1d in ?? () > from /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so > #5 0x00007fffd9220db9 in ?? () > from /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so > #6 0x00007ffff1b6dee9 in PMPI_Allreduce () from /usr/lib/libmpi.so.0 > #7 0x00007ffff2e7aa74 in PetscSplitOwnership () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #8 0x00007ffff2eee129 in PetscLayoutSetUp () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #9 0x00007ffff2f31cf7 in VecCreate_MPI_Private () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #10 0x00007ffff2f32092 in VecCreate_MPI () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #11 0x00007ffff2f234f7 in VecSetType () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #12 0x00007ffff2f32708 in VecCreate_Standard () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #13 0x00007ffff2f234f7 in VecSetType () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > ---Type <return> to continue, or q <return> to quit--- > #14 0x00007ffff2fb75a1 in MatGetVecs () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #15 0x00007ffff335fdc6 in PCSetUp_HYPRE () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #16 0x00007ffff3362cd6 in PCSetUp () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #17 0x00007ffff33f676e in KSPSetUp () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #18 0x00007ffff33f7bfe in KSPSolve () > from /usr/local/pkg/petsc/3.4.0/gnu/lib/libpetsc.so > #19 0x00007ffff77082f4 in dolfin::PETScKrylovSolver::solve > (this=0x9700f0, x= ..., b=...) > at > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:445 > #20 0x00007ffff7709228 in dolfin::PETScKrylovSolver::solve > (this=0x9700f0, A=..., x=..., b=...) > at > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/PETScKrylovSolver.cpp:491 > #21 0x00007ffff76d9303 in dolfin::KrylovSolver::solve (this=0x94a8e0, > A=..., x=..., b=...) > at > /usr/users/blechta/fenics/fenics/src/dolfin/dolfin/la/KrylovSolver.cpp:147 > #22 0x00007ffff76f4b91 in dolfin::LinearSolver::solve > (this=0x7fffffff9c40, A=..., x=..., b=...) > _____________________________________________________________________________________ > > > On Wed, 29 May 2013 11:19:53 -0500 > Jed Brown <[email protected]> wrote: > > Jan Blechta <[email protected]> writes: > > > > > Maybe this is PETSc stack from previous time step - this is > > > provided by DOLFIN. > > > > > >> Maybe you aren't checking error codes and try to do something > > >> else collective? > > > > > > I don't know, I'm just using FEniCS. > > > > When I said "you", I was addressing the list in general, which > > includes FEniCS developers. > > > > >> > [2]PETSC ERROR: PCDestroy() line 121 > > >> > in /petsc-3.4.0/src/ksp/pc/interface/precon.c [2]PETSC ERROR: > > >> > KSPDestroy() line 788 > > >> > in /petsc-3.4.0/src/ksp/ksp/interface/itfunc.c > > >> > > > >> > and deadlocks. Did you seen it before? Where can be the > > >> > problem? > > >> > > >> Deadlock must be back in your code. This error occurs on > > >> PETSC_COMM_SELF, which means we have no way to ensure that the > > >> error condition is collective. You can't just go calling other > > >> collective functions after such an error. > > > > > > This means that DOLFIN handles poorly some error condition. > > > > It appears that way, but that appears to be independent of whatever > > causes Hypre to return an error. > > > > >> Anyway, please set up a reproducible test case and/or get a trace > > >> from inside Hypre. It will be useful for them to debug the > > >> problem. > > > > > > I'm not PETSc user so it would be quite time-consuming for me to > > > try to reproduce it without FEniCS. I will try at least get a > > > trace. > > > > You can try dumping the matrix using '-ksp_view_mat binary' (writes > > 'binaryoutput'), for example, then try solving it using a PETSc > > example, e.g. src/ksp/ksp/examples/tutorials/ex10.c with the same > > configuration via run-time options. > > _______________________________________________ > > fenics mailing list > > [email protected] > > http://fenicsproject.org/mailman/listinfo/fenics > > _______________________________________________ > fenics mailing list > [email protected] > http://fenicsproject.org/mailman/listinfo/fenics _______________________________________________ fenics mailing list [email protected] http://fenicsproject.org/mailman/listinfo/fenics
