On Wed, 21 May 2008, Jeff Squyres wrote:

I'm only concerned about the case where there's an IB card, the user
expects the IB card to be used, and the IB card isn't used.

Can you put in a site wide

btl = ^tcp

to avoid the problem?  If the IB card fails, then you'll get
unreachable MPI errors.

And how many users are going to figure that one out before complaining loudly? That's what LANL did (probably still does) and it worked great there, but that doesn't mean that others will figure that out (after all, not everyone has an OMPI developer on staff...).

If the
changes don't silence a warning in that situation, I'm fine with
whatever
you do.  But does ibv_get_device_list return an HCA when the port is
down
(because the SM failed and the machine rebooted since that time)?

Yes.

If this is true (for some reason I thought it wasn't), then I think we'd actually be ok with your proposal, but you're right, you'd need something new in the IB btl. I'm not concerned about the dual rail issue -- if you're smart enough to configure dual rail IB, you're smart enough to figure out OMPI mca params. I'm not sure the same is true for a simple delivered from the white box vendor IB setup that barely works on a good day (and unfortunately, there seems to be evidence that these exist).


Brian

Reply via email to