On May 21, 2008, at 11:14 AM, Brian W. Barrett wrote:

I think having a parameter to turn off the warning is a great idea. So great in fact, that it already exists in the trunk and v1.2 :)! Setting the default value for the btl_base_warn_component_unused flag from 0 to 1
will have the desired effect.

Ah, ok.  I either didn't know about this flag or forgot about it.  :-)

I just tested this myself and see that there are actually *two* error messages (on a machine where I installed libibverbs, but with no OpenFabrics hardware, with OMPI 1.2.6):

% mpirun -np 1 hello
libibverbs: Fatal: couldn't read uverbs ABI version.
--------------------------------------------------------------------------
[0,1,0]: OpenIB on host eddie.osl.iu.edu was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

So the MCA param takes care of the OMPI message; I'll contact the libibverbs authors about their message.

I'm not sure I agree with setting the default to 0, however. The warning has proven extremely useful for diagnosing that IB (or less often GM or MX) isn't properly configured on a compute node due to some random error.
It's trivially easy for any packaging group to have the line

  btl_base_warn_component_unused = 0

added to $prefix/etc/openmpi-mca-params.conf during the install phase of
the package build (indeed, our simple build scripts at LANL used to do
this on a regular bases due to our need to tweek the OOB to keep IPoIB
happier at scale).

I think keeping the Debian guys happy is a good thing.  Giving them an
easy way to turn off silly warnings is a good thing.  Removing a known
useful warning to help them doesn't seem like a good thing.

I guess that this is what I am torn about. Yes, it's a useful message -- in some cases. But now that libibverbs is shipping in Debain and other Linuxes, the number of machines out there with verbs-capable hardware is far, far smaller than the number of machines without verbs- capable hardware. Specifically:

1. The number of cases where seeing the message by default is *not* useful is now potentially [much] larger than the number of cases where the default message is useful.

2. An out-of-the-box "mpirun a.out" will print warning messages in perfectly valid/good configurations (no verbs-capable hardware, but just happen to have libibverbs installed). This is a Big Deal.

3. Problems with HCA hardware and/or verbs stack are uncommon (nowadays). I'd be ok asking someone to enable a debug flag to get more information on configuration problems or hardware faults.

Shouldn't we be optimizing for the common case?

In short: I think it's no longer safe to assume that machines with libibverbs installed must also have verbs-capable hardware.

--
Jeff Squyres
Cisco Systems

Reply via email to