Hi,
I'm currently trying to have Open MPI exit more gracefully when a BTL
returns an error during the "add procs" phase.
The current bml/r2 code silently ignores btl->add_procs() error codes with
the following comment :
---- ompi/mca/bml/r2/bml_r2.c:208 ----
/* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
* can take care of this task. */
continue;
--------------------------------------
This seems wrong to me : either a proc is reached (the "reachable" bit
field is therefore updated), either it is not (and nothing is done). Any
error code should denote a fatal error needing a clean abort.
In the current openib btl code, the "reachable" bit is set but an error is
returned - then ignored by r2. The next call to the openib BTL results in
a segmentation fault.
So, maybe this simple fix would do the trick :
========================================================================
diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
/* This BTL has troubles adding the nodes. Let's continue maybe
some other BTL
* can take care of this task.
*/
- continue;
+ return rc;
}
/* for each proc that is reachable */
========================================================================
Does anyone see a case (with a specific btl) where add_procs returns an
error but we still want to continue ?
Sylvain