Are these ghosted vectors?

Can't imagine how it could happen, but if the ghost indices are not symmetric 
where they should be you could have processor m waiting on a message from 
processor n that is not coming...

-Ben




On Jan 23, 2013, at 12:00 PM, "Cody Permann" <[email protected]> wrote:

> Alright, I could use more sets of eyeballs to help me find the source of a
> hanging job.  We have a user running MOOSE on our cluster here and the job
> hangs after several steps.  It's an end user code so explaining every
> single piece of the application would be rather long winded.  There are a
> couple of highlights though that I'll mention here.
> 
> 1. This application goes through multiple mesh adaptivity cycles between
> solves.  i.e.  We compute error vectors, mark the mesh, refine and coarsen
> multiple times without another solve in-between.
> 
> 2. We also forcefully change select values in the solution vector(s) at the
> end of the timestep, before the next solve.
> 
> Derek and I have written a small python script which attaches a debugger to
> a hung job on a cluster and prints the number of processes in each "unique"
> state (unique determined by the stack trace).  Here is the output:
> 
> Unique Stack Traces
> ************************************
> Count: 32
> #0  0x00002b2dea658696 in poll () from /lib64/libc.so.6
> #1  0x00002b2de97a3ef0 in poll_dispatch ()
> #2  0x00002b2de97a2c23 in opal_event_base_loop ()
> #3  0x00002b2de9797901 in opal_progress ()
> #4  0x00002b2de8f1a0c5 in ompi_request_default_wait_any ()
> #5  0x00002b2de8f4775d in PMPI_Waitany ()
> #6  0x0000000001387389 in VecScatterEnd_1 ()
> #7  0x0000000001382cf4 in VecScatterEnd ()
> #8  0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh::
> **NumericVector<double>&) const ()
> #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***,
> bool) const ()
> #10 0x0000000000e336ed in libMesh::System::project_**vecto
> r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&)
> const ()
> #11 0x0000000000e340dc in libMesh::System::project_**vecto
> r(libMesh::NumericVector<**double>&) const ()
> #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> #17 0x00000000009a0efb in Transient::execute() ()
> #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> #20 0x000000000097fd95 in MooseApp::run() ()
> #21 0x00000000006d93e5 in main ()
> 
> ************************************
> Count: 64
> #0  0x00002adad987f696 in poll () from /lib64/libc.so.6
> #1  0x00002adad89caef0 in poll_dispatch ()
> #2  0x00002adad89c9c23 in opal_event_base_loop ()
> #3  0x00002adad89be901 in opal_progress ()
> #4  0x00002adad814122d in ompi_request_default_wait_all ()
> #5  0x00002adad816e5ad in PMPI_Waitall ()
> #6  0x0000000001387050 in VecScatterEnd_1 ()
> #7  0x0000000001382cf4 in VecScatterEnd ()
> #8  0x0000000000d8aaa8 in libMesh::PetscVector<double>::**localize(libMesh::
> **NumericVector<double>&) const ()
> #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> raints_exactly(libMesh::**System const&, libMesh::NumericVector<double>***,
> bool) const ()
> #10 0x0000000000e336ed in libMesh::System::project_**vecto
> r(libMesh::NumericVector<**double> const&, libMesh::NumericVector<double>**&)
> const ()
> #11 0x0000000000e340dc in libMesh::System::project_**vecto
> r(libMesh::NumericVector<**double>&) const ()
> #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> #17 0x00000000009a0efb in Transient::execute() ()
> #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> #20 0x000000000097fd95 in MooseApp::run() ()
> #21 0x00000000006d93e5 in main ()
> 
> Take a look at frame 5 in each of these stack traces:  This is all the way
> down inside of PETSc but appears to be the source of the problem.  Does
> anyone have any ideas of how we might "split" branches in libMesh or PETSc
> and end up in this state?
> 
> Thanks for any ideas you may have,
> Cody
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnnow-d2d
> _______________________________________________
> Libmesh-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/libmesh-users

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to