On Mar 4, 2015, at 3:25 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> On Wed, Mar 4, 2015 at 1:04 PM, Dave Goodell (dgoodell) <dgood...@cisco.com> 
> wrote:
> [...]
> > libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
> > libibverbs: Warning: no userspace device-specific driver found for 
> > /sys/class/infiniband_verbs/uverbs0
> 
> I think that warning is printed by libibverbs itself.  Are you 100% sure 
> there are no IB HCAs sitting in the head node?  If there are IB HCAs but you 
> don't want them to be used, you might want to ensure that the various verbs 
> kernel modules don't get loaded, which is one half of the mismatch which 
> confuses libibverbs.
> [...]
>  
> FWIW, I can confirm that these two lines are from libibverbs itself:
> $ strings /usr/lib64/libibverbs.a | grep -e 'no userspace' -e 'open config 
> directory'
> libibverbs: Warning: no userspace device-specific driver found for %s
> libibverbs: Warning: couldn't open config directory '%s'.

Yes, I think you'd also see the same message if you run "ibv_devices" or 
"ibv_devinfo" on the head node.

> As it happens, the login node *does* have an HCA installed and the kernel 
> modules appears to be loaded.  However, as the "17th node" in the cluster it 
> was never cabled to the 16-port switch and the package(s) that should have 
> created/populated /etc/libibverbs.d are *not* present (specifically the login 
> node has libipathverbs-devel installed but not libipathverbs).
> 
> So, Dave, are you saying that what I describe in the previous paragraph would 
> be considered "misconfiguration"?  I am fine with dropping the discussion of 
> those first two lines if there is agreement that Open MPI shouldn't be 
> responsible for handling this case.

I would consider that to be a lesser misconfiguration, which is only really an 
issue because of libibverbs deficiencies.  Either the hardware could be removed 
from the head node or the kernel modules could be unloaded / prevented from 
loading on the head node.

> Now the ibv_fork_init() warnings are another issue entirely.  Since btl:verbs 
> and mtl:psm both work (at least separately) perfectly fine on the compute 
> nodes, I don't believe that there are any configuration issues there.

Agreed, something needs to be improved there.  I assume that Mike D. or someone 
from his team will take a look.  I don't have any bandwidth to look at this 
myself.

-Dave

Reply via email to