On Wed, 21 Aug 2013, John Peterson wrote:

So... if the libmesh_terminate_handler gets called, it looks like it
will eventually call std::terminate()

Sure, unless the user hasn't already set_terminate() some replacement
handler.  That's what would have happened if the
libmesh_terminate_handler hadn't existed at all, too.

which if I understand correctly [0], calls std::abort().  

Right.

Now if libmesh_error() is called on only one processor, throws, and
is not caught, I believe that processor will call std::terminate(),
thereby std::aborting on 1 proc and possibly not killing the MPI job
nicely.

Roy, 5 minutes ago: "Actually what happens when you throw an exception
is that the stack unwinds, stack allocated objects like the
LibMeshInit object are destroyed, and so any cleanup in their
destructors, such as closing up MPI, will have already taken place by
the time we hit the terminate handler."

Roy, after checking the C++ standard: "Uh oh."

So it turns out that although RAII should work correctly for
exceptions which get caught, it's "implementation-defined" whether the
stack gets properly unwound for an uncaught exception.

If that's the case, it seems that some of the logic in the
LibMeshInit destructor could be moved to libmesh_terminate_handler()

It looks like (unless we want to force everyone to wrap their whole
LibMeshInit object lifetime in a try/catch(...)) this is the only way
not to risk MPI jobs hanging in certain configurations on certain
errors.

and if LibMeshInit's destructor detects an uncaught_exception, it
could call std::terminate manually.

I don't know if manual std::terminate() is kosher, and it definitely
shouldn't be necessary; if there's an uncaught exception then we'll
get to the terminate handler automatically.

Anyway, there's existing code in that destructor testing for uncaught
exceptions and doing MPI_Abort instead of _Finalize accordingly; we're
just not *getting* to that code unless the stack is unwound.  It's the
"terminate gets called without a stack unwind" case that's probably
screwing you guys up.

Sadly, for implementations where terminate gets called *after* a stack
unwind we have a different problem: the stack trace we print out in
our terminate handler will be empty.

Thoughts?

To fix the "no MPI Abort when the stack isn't unwound" case: Check
MPI_Initialized() in our terminate handler, call MPI_Abort() from
there if it's true?

To fix the "no stack trace when the stack is unwound" case: move the
print_trace back from our terminate handler to the libmesh_error()
macro?  But this isn't a perfect fix, since we would lose traces from
other thrown exceptions.  Perhaps we could somehow keep our terminate
handler printing traces in cases where the uncaught exception isn't
from one of our macros?
---
Roy
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
Libmesh-devel mailing list
Libmesh-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-devel

Reply via email to