On Thu, 27 May 2010, Jeff Squyres wrote:

On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:

That's pretty much my first proposition : abort when an error arises,
because if we don't, we'll crash soon afterwards. That's my original
concern and this should really be fixed.

Now, if you want to fix the openib BTL so that an error in IB results in
an elegant fallback on TCP (elegant = notified ;-)), then hooray.

You're specifically referring to the point where the openib btl sets the reachable bit, but then later decides "oops, an error occurred, so return !=OMPI_SUCCESS" -- and assume that the openib btl is not called again.

Right?
Perfectly right.

If so, then yes, that's a bug. The openib btl should be fixed to unset the reachable bit(s) that it just set before returning the error.

Or the BML could assume that !=OMPI_SUCCESS codes means that the reachable bits it got back were invalid and should be ignored.

I'd lead towards the former.

Can you file and bug and submit a patch?
I'd like to (though I don't have an svn account), but some things
bother me.

Having errors on add_procs stop the application seems a good thing in all cases, so why not do it ? That would solve my original problem which lead to this discussion.

Yes, the openib BTL may be suboptimal (stopping the application instead of nicely returning), but I'm fine with that, so I'm not very inclined to spend time on this.

Sylvain

Reply via email to