I just made a change in the mpool base memory hook callback; please see:

    https://svn.open-mpi.org/trac/ompi/changeset/20984

In short, I made the error that Lenny discovered (which turned out to be an ob1 issue, not a memory hooks issue) in https://svn.open-mpi.org/trac/ompi/ticket/1875 be a fatal error rather than just calling opal_output(). So if this error ever happens again, it'll definitely show up in MTT via a bunch of failed tests (rather than someone happening to notice some opal_output's in the middle of a run).

I made the error fatal by calling _exit(), though -- quite ungraceful. The problem is that this is a void-returning callback in the middle of the memory allocator; there's no way to pass an error up higher for better handling. Other options include:

1. We could set a global variable, but then we'd have to notice that global error at some point later -- there's no real guarantee when exactly that would happen. 2. We could set a zero-time event to fire that would do some better cleanup/error handling, but that might (will?) call malloc() (remember: we're in a callback from the memory allocator, so calling malloc() may do Bad Things).
3. ...?

However, I think that if this situation arises, we're in a bad place anyway -- perhaps the most sane thing to do is just exit cleanly. "Better" error handling might have us exit a bit more cleanly (e.g., do some cleanup first), but _exit() will get the job done. And then ORTE takes over and kills the rest of the job.

*** Note that the old code was calling opal_output() to print the message, which might (will?) call malloc() anyway, so Bad Things could well have happened. Meaning that the message may not have actually gotten printed out -- yoinks. So the "print the message" code had to be updated anyway. I think the only controversial point in this commit is that I called _exit().

Comments?  Or is calling _exit() the least sucky thing to do here?

--
Jeff Squyres
Cisco Systems

Reply via email to