On Wed, Jan 23, 2013 at 9:41 PM, Cody Permann <[email protected]> wrote:

> On Wed, Jan 23, 2013 at 11:05 AM, Kirk, Benjamin (JSC-EG311) <
> [email protected]> wrote:
>
> > Are these ghosted vectors?
> >
> > Can't imagine how it could happen, but if the ghost indices are not
> > symmetric where they should be you could have processor m waiting on a
> > message from processor n that is not coming...
> >
>
> Yes, ghosted vectors.  Well I guess that's somewhere to start looking.  I
> found the location of the branch down inside of PETSc where the paths
> diverge (wait_all vs wait_any) but I admit I have no idea what's happening
> at that level. I haven't been able to get the code to hang on my local
> workstation with 8-10 processor jobs, and sadly it runs for a long time
> before hanging on the cluster sized jobs.
>
> Also, I haven't tried a full debug build yet because of the size of the
> problem, but I'll put that on the "to do" list too.  If we're lucky,
> perhaps we'll hit an assert if we ever get there.  I'll keep you posted.
>
It would be easier, if we had line numbers, but it's virtually certain that
your Waitany is waiting on a recv,
while Waitall is finalizing the sends
($PETSC_DIR/src/vec/vec/utils/vpscat.c:VecScatterEnd_).
Basically, the Waitany proc has nrecvs too high that can't be satisfied by
all of the senders,
which suggests the problem is in VecScatterCreate() and ultimately likely
in the arguments to VecCreateGhost().

That petsc code hasn't changed substantially in quite a while, with the
exception of adding optional one-sided
stuff (and some CUDA-related code), so I doubt this problem would depend on
using a particular (relatively recent) version of petsc.

Is this a AMR run?  That's the only way I would imagine the number of sends
and receives changing midway through.

Dmitry.




> Cody
>
>
> >
> > -Ben
> >
> >
> >
> >
> > On Jan 23, 2013, at 12:00 PM, "Cody Permann" <[email protected]>
> > wrote:
> >
> > > Alright, I could use more sets of eyeballs to help me find the source
> of
> > a
> > > hanging job.  We have a user running MOOSE on our cluster here and the
> > job
> > > hangs after several steps.  It's an end user code so explaining every
> > > single piece of the application would be rather long winded.  There
> are a
> > > couple of highlights though that I'll mention here.
> > >
> > > 1. This application goes through multiple mesh adaptivity cycles
> between
> > > solves.  i.e.  We compute error vectors, mark the mesh, refine and
> > coarsen
> > > multiple times without another solve in-between.
> > >
> > > 2. We also forcefully change select values in the solution vector(s) at
> > the
> > > end of the timestep, before the next solve.
> > >
> > > Derek and I have written a small python script which attaches a
> debugger
> > to
> > > a hung job on a cluster and prints the number of processes in each
> > "unique"
> > > state (unique determined by the stack trace).  Here is the output:
> > >
> > > Unique Stack Traces
> > > ************************************
> > > Count: 32
> > > #0  0x00002b2dea658696 in poll () from /lib64/libc.so.6
> > > #1  0x00002b2de97a3ef0 in poll_dispatch ()
> > > #2  0x00002b2de97a2c23 in opal_event_base_loop ()
> > > #3  0x00002b2de9797901 in opal_progress ()
> > > #4  0x00002b2de8f1a0c5 in ompi_request_default_wait_any ()
> > > #5  0x00002b2de8f4775d in PMPI_Waitany ()
> > > #6  0x0000000001387389 in VecScatterEnd_1 ()
> > > #7  0x0000000001382cf4 in VecScatterEnd ()
> > > #8  0x0000000000d8aaa8 in
> > libMesh::PetscVector<double>::**localize(libMesh::
> > > **NumericVector<double>&) const ()
> > > #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> > > raints_exactly(libMesh::**System const&,
> > libMesh::NumericVector<double>***,
> > > bool) const ()
> > > #10 0x0000000000e336ed in libMesh::System::project_**vecto
> > > r(libMesh::NumericVector<**double> const&,
> > libMesh::NumericVector<double>**&)
> > > const ()
> > > #11 0x0000000000e340dc in libMesh::System::project_**vecto
> > > r(libMesh::NumericVector<**double>&) const ()
> > > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> > > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> > > #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> > > #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> > > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> > > #17 0x00000000009a0efb in Transient::execute() ()
> > > #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> > > #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> > > #20 0x000000000097fd95 in MooseApp::run() ()
> > > #21 0x00000000006d93e5 in main ()
> > >
> > > ************************************
> > > Count: 64
> > > #0  0x00002adad987f696 in poll () from /lib64/libc.so.6
> > > #1  0x00002adad89caef0 in poll_dispatch ()
> > > #2  0x00002adad89c9c23 in opal_event_base_loop ()
> > > #3  0x00002adad89be901 in opal_progress ()
> > > #4  0x00002adad814122d in ompi_request_default_wait_all ()
> > > #5  0x00002adad816e5ad in PMPI_Waitall ()
> > > #6  0x0000000001387050 in VecScatterEnd_1 ()
> > > #7  0x0000000001382cf4 in VecScatterEnd ()
> > > #8  0x0000000000d8aaa8 in
> > libMesh::PetscVector<double>::**localize(libMesh::
> > > **NumericVector<double>&) const ()
> > > #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> > > raints_exactly(libMesh::**System const&,
> > libMesh::NumericVector<double>***,
> > > bool) const ()
> > > #10 0x0000000000e336ed in libMesh::System::project_**vecto
> > > r(libMesh::NumericVector<**double> const&,
> > libMesh::NumericVector<double>**&)
> > > const ()
> > > #11 0x0000000000e340dc in libMesh::System::project_**vecto
> > > r(libMesh::NumericVector<**double>&) const ()
> > > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> > > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> > > #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> > > #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> > > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> > > #17 0x00000000009a0efb in Transient::execute() ()
> > > #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> > > #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> > > #20 0x000000000097fd95 in MooseApp::run() ()
> > > #21 0x00000000006d93e5 in main ()
> > >
> > > Take a look at frame 5 in each of these stack traces:  This is all the
> > way
> > > down inside of PETSc but appears to be the source of the problem.  Does
> > > anyone have any ideas of how we might "split" branches in libMesh or
> > PETSc
> > > and end up in this state?
> > >
> > > Thanks for any ideas you may have,
> > > Cody
> > >
> >
> ------------------------------------------------------------------------------
> > > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> > > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> > > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> > > MVPs and experts. ON SALE this month only -- learn more at:
> > > http://p.sf.net/sfu/learnnow-d2d
> > > _______________________________________________
> > > Libmesh-users mailing list
> > > [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/libmesh-users
> >
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnnow-d2d
> _______________________________________________
> Libmesh-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/libmesh-users
>
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Libmesh-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/libmesh-users

Reply via email to