I just made a change in the mpool base memory hook callback; please see:
https://svn.open-mpi.org/trac/ompi/changeset/20984
In short, I made the error that Lenny discovered (which turned out to
be an ob1 issue, not a memory hooks issue) in https://svn.open-mpi.org/trac/ompi/ticket/1875
be a fatal error rather than just calling opal_output(). So if this
error ever happens again, it'll definitely show up in MTT via a bunch
of failed tests (rather than someone happening to notice some
opal_output's in the middle of a run).
I made the error fatal by calling _exit(), though -- quite
ungraceful. The problem is that this is a void-returning callback in
the middle of the memory allocator; there's no way to pass an error up
higher for better handling. Other options include:
1. We could set a global variable, but then we'd have to notice that
global error at some point later -- there's no real guarantee when
exactly that would happen.
2. We could set a zero-time event to fire that would do some better
cleanup/error handling, but that might (will?) call malloc()
(remember: we're in a callback from the memory allocator, so calling
malloc() may do Bad Things).
3. ...?
However, I think that if this situation arises, we're in a bad place
anyway -- perhaps the most sane thing to do is just exit cleanly.
"Better" error handling might have us exit a bit more cleanly (e.g.,
do some cleanup first), but _exit() will get the job done. And then
ORTE takes over and kills the rest of the job.
*** Note that the old code was calling opal_output() to print the
message, which might (will?) call malloc() anyway, so Bad Things could
well have happened. Meaning that the message may not have actually
gotten printed out -- yoinks. So the "print the message" code had to
be updated anyway. I think the only controversial point in this
commit is that I called _exit().
Comments? Or is calling _exit() the least sucky thing to do here?
--
Jeff Squyres
Cisco Systems