Are these ghosted vectors? Can't imagine how it could happen, but if the ghost indices are not symmetric where they should be you could have processor m waiting on a message from processor n that is not coming...
-Ben On Jan 23, 2013, at 12:00 PM, "Cody Permann" <[email protected]> wrote: > Alright, I could use more sets of eyeballs to help me find the source of a > hanging job. We have a user running MOOSE on our cluster here and the job > hangs after several steps. It's an end user code so explaining every > single piece of the application would be rather long winded. There are a > couple of highlights though that I'll mention here. > > 1. This application goes through multiple mesh adaptivity cycles between > solves. i.e. We compute error vectors, mark the mesh, refine and coarsen > multiple times without another solve in-between. > > 2. We also forcefully change select values in the solution vector(s) at the > end of the timestep, before the next solve. > > Derek and I have written a small python script which attaches a debugger to > a hung job on a cluster and prints the number of processes in each "unique" > state (unique determined by the stack trace). Here is the output: > > Unique Stack Traces > ************************************ > Count: 32 > #0 0x00002b2dea658696 in poll () from /lib64/libc.so.6 > #1 0x00002b2de97a3ef0 in poll_dispatch () > #2 0x00002b2de97a2c23 in opal_event_base_loop () > #3 0x00002b2de9797901 in opal_progress () > #4 0x00002b2de8f1a0c5 in ompi_request_default_wait_any () > #5 0x00002b2de8f4775d in PMPI_Waitany () > #6 0x0000000001387389 in VecScatterEnd_1 () > #7 0x0000000001382cf4 in VecScatterEnd () > #8 0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh:: > **NumericVector<double>&) const () > #9 0x0000000000ae831b in libMesh::DofMap::enforce_**const > raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***, > bool) const () > #10 0x0000000000e336ed in libMesh::System::project_**vecto > r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&) > const () > #11 0x0000000000e340dc in libMesh::System::project_**vecto > r(libMesh::NumericVector<**double>&) const () > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() () > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() () > #14 0x00000000008ca72d in FEProblem::meshChanged() () > #15 0x00000000008b12d3 in FEProblem::adaptMesh() () > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() () > #17 0x00000000009a0efb in Transient::execute() () > #18 0x0000000000980f66 in MooseApp::runInputFile() () > #19 0x00000000009868ce in MooseApp::parseCommandLine() () > #20 0x000000000097fd95 in MooseApp::run() () > #21 0x00000000006d93e5 in main () > > ************************************ > Count: 64 > #0 0x00002adad987f696 in poll () from /lib64/libc.so.6 > #1 0x00002adad89caef0 in poll_dispatch () > #2 0x00002adad89c9c23 in opal_event_base_loop () > #3 0x00002adad89be901 in opal_progress () > #4 0x00002adad814122d in ompi_request_default_wait_all () > #5 0x00002adad816e5ad in PMPI_Waitall () > #6 0x0000000001387050 in VecScatterEnd_1 () > #7 0x0000000001382cf4 in VecScatterEnd () > #8 0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh:: > **NumericVector<double>&) const () > #9 0x0000000000ae831b in libMesh::DofMap::enforce_**const > raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***, > bool) const () > #10 0x0000000000e336ed in libMesh::System::project_**vecto > r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&) > const () > #11 0x0000000000e340dc in libMesh::System::project_**vecto > r(libMesh::NumericVector<**double>&) const () > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() () > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() () > #14 0x00000000008ca72d in FEProblem::meshChanged() () > #15 0x00000000008b12d3 in FEProblem::adaptMesh() () > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() () > #17 0x00000000009a0efb in Transient::execute() () > #18 0x0000000000980f66 in MooseApp::runInputFile() () > #19 0x00000000009868ce in MooseApp::parseCommandLine() () > #20 0x000000000097fd95 in MooseApp::run() () > #21 0x00000000006d93e5 in main () > > Take a look at frame 5 in each of these stack traces: This is all the way > down inside of PETSc but appears to be the source of the problem. Does > anyone have any ideas of how we might "split" branches in libMesh or PETSc > and end up in this state? > > Thanks for any ideas you may have, > Cody > ------------------------------------------------------------------------------ > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft > MVPs and experts. ON SALE this month only -- learn more at: > http://p.sf.net/sfu/learnnow-d2d > _______________________________________________ > Libmesh-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/libmesh-users ------------------------------------------------------------------------------ Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d _______________________________________________ Libmesh-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/libmesh-users
