The BTLs are allowed to fail adding procs without major consequences in the 
short term. As you noticed each BTL returns a bit mask array containing all 
procs reachable through this particular instance of the BTL. Later (in the same 
file line 395) we check for the complete coverage for all procs, and only 
complain if one of the peers is unreachable.

If you replace the continue statement by a return, we will never give a chance 
to the other BTLs and we will complain about lack of connectivity as soon as 
one BTL fails (for some reasons). Without talking about the fact that all the 
eager, send and rmda endpoint arrays will not be built.

  george.

On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:

> Hi,
> 
> I'm currently trying to have Open MPI exit more gracefully when a BTL returns 
> an error during the "add procs" phase.
> 
> The current bml/r2 code silently ignores btl->add_procs() error codes with 
> the following comment :
> ---- ompi/mca/bml/r2/bml_r2.c:208 ----
>  /* This BTL has troubles adding the nodes. Let's continue maybe some other 
> BTL
>   * can take care of this task. */
>  continue;
> --------------------------------------
> 
> This seems wrong to me : either a proc is reached (the "reachable" bit field 
> is therefore updated), either it is not (and nothing is done). Any error code 
> should denote a fatal error needing a clean abort.
> 
> In the current openib btl code, the "reachable" bit is set but an error is 
> returned - then ignored by r2. The next call to the openib BTL results in a 
> segmentation fault.
> 
> So, maybe this simple fix would do the trick :
> ========================================================================
> diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
> --- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
> +++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
> @@ -210,7 +210,7 @@
>             /* This BTL has troubles adding the nodes. Let's continue maybe 
> some other BTL
>              * can take care of this task.
>              */
> -            continue;
> +            return rc;
>         }
> 
>         /* for each proc that is reachable */
> ========================================================================
> 
> Does anyone see a case (with a specific btl) where add_procs returns an error 
> but we still want to continue ?
> 
> Sylvain
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to