On May 21, 2008, at 4:29 PM, Brian W. Barrett wrote:

Previously, there has not been such a distinction, so I really have no
idea which caused the openib BTL throw its error (and never really cared,
as it was always somebody else's problem at that point).

In the scenarios that I'm talking about, ibv_devinfo(1) and ibv_devices(1) commands should return that there are no devices (you have OFED or equivalent installed but have no verbs-capable hardware):

-----
[15:21] queeg:~/mpi % ibv_devinfo
No IB devices found
[16:41] queeg:~/mpi % ibv_devices
    device                 node GUID
    ------              ----------------
[16:41] queeg:~/mpi %
-----

Since there's no need for an immediate change to the code base -- perhaps you could watch over the next few weeks and when you see problems of the kind that you're worried about, run ibv_devices and ibv_devinfo. If you see OMPI-reported openfabrics problems with no warnings from libibverbs itself (like I mentioned in my first mail) and ibv_dev* are reporting no devices, then we need to worry about cases where the verbs stack itself doesn't even see the devices (which is a Really Big Error; the OS/driver stack doesn't even see the device).

If ibv_dev* reports that there *are* devices when you see the errors that you're worried about, then OMPI would have gotten past this first case and reported something a bit more specific. And therefore is a different warning than the one I'm proposing to remove [by default].

I'm only concerned about the case where there's an IB card, the user
expects the IB card to be used, and the IB card isn't used.

Can you put in a site wide

btl = ^tcp

to avoid the problem? If the IB card fails, then you'll get unreachable MPI errors.

If the
changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down
(because the SM failed and the machine rebooted since that time)?

Yes.

If not,
we still ahve a (fairly common, unfortunately) error case that we need to
report (in my opinion).


Agreed. This scenario is already covered by the checking that the openib BTL performs, and I agree that we should not remove this warning.

That being said, note that the current error-checking code in the openib BTL only reports if *no* active ports are found on the host. If there are multiple ports in a host where some are active and some are [erroneously] not active, OMPI does not report this (because some real-world users have dual-port HCAs but are only using 1 port).

Two options jump to mind:

1. Add yet another MCA param to say "all my ports should be active; warn/error if you find any non-active ports." 2. Add yet another MCA param where ports that *should* be active are itemized. If OMPI finds that any of them are not active, warn/error.

#1 could really be a special case of #2 (e.g., a keyword "all"). Both of these options wouldn't be too difficult to do, but we technically are feature frozen...

--
Jeff Squyres
Cisco Systems

Reply via email to