On Wed, 21 May 2008, Jeff Squyres wrote:

On May 21, 2008, at 3:38 PM, Jeff Squyres wrote:

It would be great if libibverbs could return two different error
messages
- one for "there's no IB card in this machine" and one for "there's
an IB
card here, but we can't initialize it".  I think that would make this
argument go away.  Open MPI could probably mimic that behavior by
parsing
the PCI tables, but that sounds ... painful.


Thinking about this a bit more -- I think it depends on what kind of
errors you are worried about seeing.  IBV does separate the discovery
of devices (ibv_get_device_list) from trying to open a device
(ibv_open_device).  So hypothetically, we *can* distinguish between
these kinds of errors already.

Do you see devices that are so broken that they don't show up in the
list returned from ibv_get_device_list?

FWIW: the *only* case I'm talking about changing the default for is
when ibv_get_device_list returns an empty list (meaning that according
to the verbs stack, there are no devices in the host).  I think that
we should *always* warn for any kinds of errors that occur after that
(e.g., we find a device but can't open it, we find one or more devices
but no active ports, etc.).

Previously, there has not been such a distinction, so I really have no idea which caused the openib BTL throw its error (and never really cared, as it was always somebody else's problem at that point).

I'm only concerned about the case where there's an IB card, the user expects the IB card to be used, and the IB card isn't used. If the changes don't silence a warning in that situation, I'm fine with whatever you do. But does ibv_get_device_list return an HCA when the port is down (because the SM failed and the machine rebooted since that time)? If not, we still ahve a (fairly common, unfortunately) error case that we need to report (in my opinion).


Brian

Reply via email to