Title: OpenSM work

Hi,

 

I have just uploaded to OpenIB https://openib.org/svn/gen1/trunk/src/userspace/osm

a new change set (2255) with OpenSM improvements.

The work was done according to the plan outlined in the attached mail.

 

The main changes are:

  1. Semi-Static LID assignment - preventing LID changes due to SM restart or nodes rebooting. Important due to the bad affect LID changes have on IPoIB (and other ULPs) path record caching.
  2. Irresponsive port scanning during light sweep - Allow the SM to recognize ports that were not responding in the first sweep (but had a link state that is not down).
  3. Switch Ports that connect to HCA ports now have a lower HOQ value. This provides faster "drain" for packets waiting for bad HCA such that the entire fabric is not affected by a single HCA.

 

I will outline our next months work plan in my next mail for your comments.

 

Eitan Zahavi

Design Technology Director

Mellanox Technologies LTD

Tel:+972-4-9097208
Fax:+972-4-9593245

P.O. Box 586 Yokneam 20692 ISRAEL

 

-----Original Message-----
From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
Sent: Friday, April 08, 2005 7:40 PM
To: '[email protected]'
Subject: [openib-general] OpenSM work

 

Hi All,

FYI: Mellanox is focusing on the following items on OpenSM development for the last few weeks:

1.      Stability testing over the IB management simulator:
a.      Randomly pick bad links with high packet drop statistics - success is SUBNET UP
b.      Route using up/down algorithm - success is no credit loops

2.      Semi-static LID assignment:
a.      Developed an interface for persistent storage of arbitrary data. The goal is to enable further development of LDAP (ala Troy's request) or SQL module. Please see osm_db.h attached

  <<osm_db.h>>

b.      Developed file based implementation for osm_db.h
c.      Modify osm_lid_mgr (lid assignment algorithm) to use the LIDs stored in the persistent storage. Handle all cases of bad file and new LIDs on the fabric. The -r flag now lets OpenSM overwrite the known data. Persistent Guid to LIDs data is kept even if the GUID disappears for a while. The code also handles LID assignment for LMC > 0 in a way better then the previous algorithm: It used to assign 2^LMC LIDs for every port - even for switches port 0. Now it will only preserve 1 LID for switch port 0.

3.      Irresponsive port:
a.      The phenomenon is: A port does not respond to the SM during the discovery stage. OpenSM can not obtain enough data about the port and thus it does not appear in the final database. Since OpenSM uses light sweeps when there is no "change detected" it will not query the port until either a switch sets its "change bit" or send a trap. So that irresponsive port will never be polled again if there is no heavy sweep.

b.      The solution:
i.      During discovery track ports (physical ports) that have their logical link state != DOWN but the port on the other side of the link is not known to the SM.

ii.     During light sweep:  not only scan the switches "change bit" but also test to see if the port on the other side on these ports (from i) is responding. If it does - issue a heavy sweep.

4.      Head of Queue Life:
a.      Problem: In cases of PCI hardware failure HCAs can not complete RDMA requests and loose all credits from their input ports (in other words: their input buffers are filled). So they create back pressure on the fabric.

b.      Solution: use a fast head of queue time limit on every switch port that drives an HCA.

5.      SA queries stress testing:
a.      We are exploring max performance of the SA and ways to improve it.

Eitan

 

Eitan Zahavi

Design Technology Director

Mellanox Technologies LTD

Tel:+972-4-9097208
Fax:+972-4-9593245

P.O. Box 586 Yokneam 20692 ISRAEL

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to