Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with the following comment :
---- ompi/mca/bml/r2/bml_r2.c:208 ----
  /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
   * can take care of this task. */
  continue;
--------------------------------------

This seems wrong to me : either a proc is reached (the "reachable" bit field is therefore updated), either it is not (and nothing is done). Any error code should denote a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is returned - then ignored by r2. The next call to the openib BTL results in a segmentation fault.

So, maybe this simple fix would do the trick :
========================================================================
diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
             /* This BTL has troubles adding the nodes. Let's continue maybe 
some other BTL
              * can take care of this task.
              */
-            continue;
+            return rc;
         }

         /* for each proc that is reachable */
========================================================================

Does anyone see a case (with a specific btl) where add_procs returns an error but we still want to continue ?

Sylvain

Reply via email to