[OMPI devel] mpool errors fatal

Jeff Squyres Mon, 13 Apr 2009 21:05:56 -0400

I just made a change in the mpool base memory hook callback; please see:


    https://svn.open-mpi.org/trac/ompi/changeset/20984

In short, I made the error that Lenny discovered (which turned out tobe an ob1 issue, not a memory hooks issue) in https://svn.open-mpi.org/trac/ompi/ticket/1875be a fatal error rather than just calling opal_output(). So if thiserror ever happens again, it'll definitely show up in MTT via a bunchof failed tests (rather than someone happening to notice someopal_output's in the middle of a run).

I made the error fatal by calling _exit(), though -- quiteungraceful. The problem is that this is a void-returning callback inthe middle of the memory allocator; there's no way to pass an error uphigher for better handling. Other options include:

1. We could set a global variable, but then we'd have to notice thatglobal error at some point later -- there's no real guarantee whenexactly that would happen.2. We could set a zero-time event to fire that would do some bettercleanup/error handling, but that might (will?) call malloc()(remember: we're in a callback from the memory allocator, so callingmalloc() may do Bad Things).

3. ...?

However, I think that if this situation arises, we're in a bad placeanyway -- perhaps the most sane thing to do is just exit cleanly."Better" error handling might have us exit a bit more cleanly (e.g.,do some cleanup first), but _exit() will get the job done. And thenORTE takes over and kills the rest of the job.

*** Note that the old code was calling opal_output() to print themessage, which might (will?) call malloc() anyway, so Bad Things couldwell have happened. Meaning that the message may not have actuallygotten printed out -- yoinks. So the "print the message" code had tobe updated anyway. I think the only controversial point in thiscommit is that I called _exit().


Comments?  Or is calling _exit() the least sucky thing to do here?

--
Jeff Squyres
Cisco Systems

[OMPI devel] mpool errors fatal

Reply via email to