Jeff Squyres wrote:
On May 22, 2008, at 6:50 AM, Terry Dontje wrote:
Brian and I chatted a bit about this off-list, and I think we're in
agreement now:
- do not change the default value or meaning of
btl_base_want_component_unsed.
- major point of confusion: the openib BTL is actually fairly unique
in that it can (and does) tell the difference between "there are no
devices present" and "there are devices, but something went wrong".
Other BTL's have network interfaces that can't tell the difference
and
can *only* call the no_nics function, regardless of whether there are
no relevant network interfaces or some error occurred during
initialization.
- so a reasonable solution would be an openib-BTL-specific mechanism
that doesn't call the no_nics function (to display that
btl_base_want_component_unused) if there are no verbs-capable devices
found because of the fact that mainline Linuxes are starting to ship
libibverbs. Specific mechanism TBD; likely to be an openib MCA
param.
So, if you are delivering something similar to a BTL for myrinet you
will see the message but
the belief is this is necessary since there isn't enough granularity
in
the error reporting of the
device to feel comfortable enough as to whether the user want the
device
to be used?
The major difference here is that libmyriexpress is not being included
in mainline Linux distributions. Specifically: if you can find/use
libmyriexpress, it's likely because you have that hardware. The same
*used* to be true for libibverbs, but is no longer true because Linux
distros are now shipping (e.g., the Debian distribution pulls in
libibverbs when you install Open MPI).
Ok, but there are distributions that do include the myrinet BTL/MTL (ie
CT). Though I agree
for the most part in the case of myrinet if you have libmyriexpress you
probably will probably have
an operable interface. I guess I am curious how many other BTLs a
distribution might end up
delivering that could run into this reporting issue. I guess my point
is could this be worth something
more general instead of a one off for IB?
From my point of view the btl_warn_unused_components coupled with "-mca
btl ^mlfbtl" works for
me. However the fact that the IB vendors/community (ie CISCO) is
solving this for their favorite interface
makes me pause for a moment.
Won't udapl have a similar issue here or does it not get built by
default when OFED is built?
We decided that under Linux, the udapl BTL does not get built by
default (even if it could) because then an "mpirun a.out" by default
would use both UDAPL and verbs, which is undesirable for several
reasons. There's Linux-specific logic to this effect in config/
ompi_check_udapl.m4.
Ok, that makes sense.
FWIW, our distribution actually turns off
btl_base_want_component_unused
because it seemed
the majority of our cases would be that users would false positive
sights of the message.
Is the UDAPL library shipped in Solaris by default? If so, then
you're likely in exactly the same kind of situation that I'm
describing. The same will be true if Solaris ends up shipping
libibverbs by default.
Yes the UDAPL library is shipped in Solaris by default. Which is why we
turn off
btl_warn_unused_components. Yes, and I suspect once Solaris starts
delivering libibverbs
we (Sun) will need to figure out how to handle having both the udapl and
openib btls being
available.
--td