On Thu, 22 May 2008, Terry Dontje wrote:

The major difference here is that libmyriexpress is not being included
in mainline Linux distributions.  Specifically: if you can find/use
libmyriexpress, it's likely because you have that hardware.  The same
*used* to be true for libibverbs, but is no longer true because Linux
distros are now shipping (e.g., the Debian distribution pulls in
libibverbs when you install Open MPI).

Ok, but there are distributions that do include the myrinet BTL/MTL (ie CT). Though I agree for the most part in the case of myrinet if you have libmyriexpress you probably will probably have an operable interface. I guess I am curious how many other BTLs a distribution might end up delivering that could run into this reporting issue. I guess my point is could this be worth something more general instead of a one off for IB?

From my point of view the btl_warn_unused_components coupled with "-mca btl ^mlfbtl" works for me. However the fact that the IB vendors/community (ie CISCO) is solving this for their favorite interface makes me pause for a moment.

There's actually a second (in my mind more important) reason why this is IB only, as I shared similar concerns (hence yesterday's e-mail barage). InfiniBand has a two stage initialization -- you get the list of HCAs, then you initialize the HCA you want. So it's possible to determine that there's no HCAs in the system vs. the system couldn't initialize the HCA properly (as that would happen in step 2, according to Jeff).

With MX, it's one initialization call (mx_init), and it's not clear from the errors it can return that you can differentiate between the two cases. I haven't tried it, but it's possible that mx_init would succeed in the no nic case, but then have a NIC count of 0.

Anyway, the short answer is that (in my opinion) we should have a btl base param similar to warn_unused for whether to warn when no NICs/HCAs are found, hopefully with a nice error function similar to today's no_nics (which probably needs to be renamed in that case). That way, if BTL authors other than OpenIB want to do some extra work and return better error messages, they can.

FWIW, our distribution actually turns off
btl_base_want_component_unused
because it seemed
the majority of our cases would be that users would false positive
sights of the message.

Is the UDAPL library shipped in Solaris by default?  If so, then
you're likely in exactly the same kind of situation that I'm
describing.  The same will be true if Solaris ends up shipping
libibverbs by default.

Yes the UDAPL library is shipped in Solaris by default.  Which is why we
turn off
btl_warn_unused_components.  Yes, and I suspect once Solaris starts
delivering libibverbs
we (Sun) will need to figure out how to handle having both the udapl and
openib btls being
available.

There is some evil configure hackery that could be done to make this work in a more general way (don't you love it when I say that). Autogen/configure makes no guarantees about the order in which the configure.m4 macros for components in the same framework are run, other than all components of priority X are run before those of priority Y, iff X > Y. So you could set the priority of all the components except udapl to (say) 10 and udapl's to 0. Then have the udapl configure only build if 1) it was specifically requested or 2) ompi_check_openib_happy = no. No more Linux-specific stuff, works when Solaris gets OFED, and works on old Solaris that has uDAPL but not OFED.

As a matter of fact, it's so trivial to do that I'd recommend doing it for 1.3. Really, you could do it minimally by only changing OpenIB's configure.params to set its priority to 10, uDAPL's configure.params to set its priority to 0, and uDAPL's configure.m4 to remove the Linux stuff and look for ompi_check_openib_happy.


Brian

Reply via email to