Alright, I could use more sets of eyeballs to help me find the source of a
hanging job.  We have a user running MOOSE on our cluster here and the job
hangs after several steps.  It's an end user code so explaining every
single piece of the application would be rather long winded.  There are a
couple of highlights though that I'll mention here.

1. This application goes through multiple mesh adaptivity cycles between
solves.  i.e.  We compute error vectors, mark the mesh, refine and coarsen
multiple times without another solve in-between.

2. We also forcefully change select values in the solution vector(s) at the
end of the timestep, before the next solve.

Derek and I have written a small python script which attaches a debugger to
a hung job on a cluster and prints the number of processes in each "unique"
state (unique determined by the stack trace).  Here is the output:

Unique Stack Traces
************************************
Count: 32
#0  0x00002b2dea658696 in poll () from /lib64/libc.so.6
#1  0x00002b2de97a3ef0 in poll_dispatch ()
#2  0x00002b2de97a2c23 in opal_event_base_loop ()
#3  0x00002b2de9797901 in opal_progress ()
#4  0x00002b2de8f1a0c5 in ompi_request_default_wait_any ()
#5  0x00002b2de8f4775d in PMPI_Waitany ()
#6  0x0000000001387389 in VecScatterEnd_1 ()
#7  0x0000000001382cf4 in VecScatterEnd ()
#8  0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh::
**NumericVector<double>&) const ()
#9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***,
bool) const ()
#10 0x0000000000e336ed in libMesh::System::project_**vecto
r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&)
const ()
#11 0x0000000000e340dc in libMesh::System::project_**vecto
r(libMesh::NumericVector<**double>&) const ()
#12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
#13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
#14 0x00000000008ca72d in FEProblem::meshChanged() ()
#15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
#16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
#17 0x00000000009a0efb in Transient::execute() ()
#18 0x0000000000980f66 in MooseApp::runInputFile() ()
#19 0x00000000009868ce in MooseApp::parseCommandLine() ()
#20 0x000000000097fd95 in MooseApp::run() ()
#21 0x00000000006d93e5 in main ()

************************************
Count: 64
#0  0x00002adad987f696 in poll () from /lib64/libc.so.6
#1  0x00002adad89caef0 in poll_dispatch ()
#2  0x00002adad89c9c23 in opal_event_base_loop ()
#3  0x00002adad89be901 in opal_progress ()
#4  0x00002adad814122d in ompi_request_default_wait_all ()
#5  0x00002adad816e5ad in PMPI_Waitall ()
#6  0x0000000001387050 in VecScatterEnd_1 ()
#7  0x0000000001382cf4 in VecScatterEnd ()
#8  0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh::
**NumericVector<double>&) const ()
#9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***,
bool) const ()
#10 0x0000000000e336ed in libMesh::System::project_**vecto
r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&)
const ()
#11 0x0000000000e340dc in libMesh::System::project_**vecto
r(libMesh::NumericVector<**double>&) const ()
#12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
#13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
#14 0x00000000008ca72d in FEProblem::meshChanged() ()
#15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
#16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
#17 0x00000000009a0efb in Transient::execute() ()
#18 0x0000000000980f66 in MooseApp::runInputFile() ()
#19 0x00000000009868ce in MooseApp::parseCommandLine() ()
#20 0x000000000097fd95 in MooseApp::run() ()
#21 0x00000000006d93e5 in main ()

Take a look at frame 5 in each of these stack traces:  This is all the way
down inside of PETSc but appears to be the source of the problem.  Does
anyone have any ideas of how we might "split" branches in libMesh or PETSc
and end up in this state?

Thanks for any ideas you may have,
Cody
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to