Hi Fab, On Wed, Sep 17, 2008 at 4:28 PM, Fab Tillier <[EMAIL PROTECTED]> wrote: >>Hi, >> >>We recently found that on several systems, different os with different >>hca's ipoib is not able to establish connection due to some timeout. >>Once the hca was disabled and enabled (in device manager) the problem >>was gone. We have a very busy infiniband network: many nodes connected >>and tests running 24x7, but this is nothing compared to client's >>network. >>I think this situation requires better handling, message in system log >>(see below) is not enough. Maybe something repetitive that sends this >>query every few seconds as long as connection is not established when it >>should be. Any thoughts? > > IPoIB allows 10 seconds (1 second timeouts, 10 retries) by default to hear > back from the SM. Even if you get past this issue, you will likely run into > the same timeouts when querying for paths to respond to ARP requests. While > you maybe able to do something internally to IPoIB or IBAL to exponentially > back off for these queries, the OS will not give you more time to get a > response from the SM, and the ARP resolution will timeout. > > In my experience, this issue is related to the SA not being in sync with the > topology recently discovered by the SM.
How did you determine this ? What SM ? If OpenSM, which version ? Is it recent ? At least in terms of OpenSM, I'm not sure what you mean by in sync with recent discovered topology as the SA and SM share the same data. > What happens is that IPoIB will issue the port info query as soon as the IB > port is up (SM moved port to active state), but the SA doesn't have a record > for the port yet. The SM should update the SA's topology before bringing the > ports active for things to work properly. When you say port is up do you mean PhysicalPortState or PortState ? At least for OpenSM, once the port is discovered by the SM, it would be reported in a SA Get or GetTable PortInfoRecord. There is a window between when the PhysicalPortState is LinkUp and the SM discovers it. -- Hal > The reason disable/enable solves the issue is that by the time IPoIB is > enabled again, the SA's topology matches the SM's (there's more of a delay > with IPoIB being reported and the SM simultaneously bringing the HCA's port > up). You can get the same result just by disabling/enabling IPoIB. > > You could add a delay when the port first comes up and likely see things work > properly. Any such delay should really be implemented in IBAL or in the HCA > driver, though ideally the SM would synchronize with the SA earlier. > > -Fab > _______________________________________________ > ofw mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw > _______________________________________________ ofw mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
