Hi Fab,

On Wed, Sep 17, 2008 at 4:28 PM, Fab Tillier
<[EMAIL PROTECTED]> wrote:
>>Hi,
>>
>>We recently found that on several systems, different os with different
>>hca's ipoib is not able to establish connection due to some timeout.
>>Once the hca was disabled and enabled (in device manager) the problem
>>was gone. We have a very busy infiniband network: many nodes connected
>>and tests running 24x7, but this is nothing compared to client's
>>network.
>>I think this situation requires better handling, message in system log
>>(see below) is not enough. Maybe something repetitive that sends this
>>query every few seconds as long as connection is not established when it
>>should be. Any thoughts?
>
> IPoIB allows 10 seconds (1 second timeouts, 10 retries) by default to hear 
> back from the SM.  Even if you get past this issue, you will likely run into 
> the same timeouts when querying for paths to respond to ARP requests.  While 
> you maybe able to do something internally to IPoIB or IBAL to exponentially 
> back off for these queries, the OS will not give you more time to get a 
> response from the SM, and the ARP resolution will timeout.
>
> In my experience, this issue is related to the SA not being in sync with the 
> topology recently discovered by the SM.

How did you determine this ?

What SM ? If OpenSM, which version ? Is it recent ?

At least in terms of OpenSM, I'm not sure what you mean by in sync
with recent discovered topology as the SA and SM share the same data.

>  What happens is that IPoIB will issue the port info query as soon as the IB 
> port is up (SM moved port to active state), but the SA doesn't have a record 
> for the port yet.  The SM should update the SA's topology before bringing the 
> ports active for things to work properly.

When you say port is up do you mean PhysicalPortState or PortState ?

At least for OpenSM, once the port is discovered by the SM, it would
be reported in a SA Get or GetTable PortInfoRecord. There is a window
between when the PhysicalPortState is LinkUp and the SM discovers it.

-- Hal

> The reason disable/enable solves the issue is that by the time IPoIB is 
> enabled again, the SA's topology matches the SM's (there's more of a delay 
> with IPoIB being reported and the SM simultaneously bringing the HCA's port 
> up).  You can get the same result just by disabling/enabling IPoIB.
>
> You could add a delay when the port first comes up and likely see things work 
> properly.  Any such delay should really be implemented in IBAL or in the HCA 
> driver, though ideally the SM would synchronize with the SA earlier.
>
> -Fab
> _______________________________________________
> ofw mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw
>
_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to