On May 21, 2008, at 4:29 PM, Brian W. Barrett wrote:
Previously, there has not been such a distinction, so I really have no
idea which caused the openib BTL throw its error (and never really
cared,
as it was always somebody else's problem at that point).
In the scenarios that I'm talking about, ibv_devinfo(1) and
ibv_devices(1) commands should return that there are no devices (you
have OFED or equivalent installed but have no verbs-capable hardware):
-----
[15:21] queeg:~/mpi % ibv_devinfo
No IB devices found
[16:41] queeg:~/mpi % ibv_devices
device node GUID
------ ----------------
[16:41] queeg:~/mpi %
-----
Since there's no need for an immediate change to the code base --
perhaps you could watch over the next few weeks and when you see
problems of the kind that you're worried about, run ibv_devices and
ibv_devinfo. If you see OMPI-reported openfabrics problems with no
warnings from libibverbs itself (like I mentioned in my first mail)
and ibv_dev* are reporting no devices, then we need to worry about
cases where the verbs stack itself doesn't even see the devices (which
is a Really Big Error; the OS/driver stack doesn't even see the device).
If ibv_dev* reports that there *are* devices when you see the errors
that you're worried about, then OMPI would have gotten past this first
case and reported something a bit more specific. And therefore is a
different warning than the one I'm proposing to remove [by default].
I'm only concerned about the case where there's an IB card, the user
expects the IB card to be used, and the IB card isn't used.
Can you put in a site wide
btl = ^tcp
to avoid the problem? If the IB card fails, then you'll get
unreachable MPI errors.
If the
changes don't silence a warning in that situation, I'm fine with
whatever
you do. But does ibv_get_device_list return an HCA when the port is
down
(because the SM failed and the machine rebooted since that time)?
Yes.
If not,
we still ahve a (fairly common, unfortunately) error case that we
need to
report (in my opinion).
Agreed. This scenario is already covered by the checking that the
openib BTL performs, and I agree that we should not remove this warning.
That being said, note that the current error-checking code in the
openib BTL only reports if *no* active ports are found on the host.
If there are multiple ports in a host where some are active and some
are [erroneously] not active, OMPI does not report this (because some
real-world users have dual-port HCAs but are only using 1 port).
Two options jump to mind:
1. Add yet another MCA param to say "all my ports should be active;
warn/error if you find any non-active ports."
2. Add yet another MCA param where ports that *should* be active are
itemized. If OMPI finds that any of them are not active, warn/error.
#1 could really be a special case of #2 (e.g., a keyword "all"). Both
of these options wouldn't be too difficult to do, but we technically
are feature frozen...
--
Jeff Squyres
Cisco Systems