Hi,

We've noticed while setting up a new cluster a problem with OpenSM.
As usual, there are some cable problems while plugging the cluster but one of 
the cable was changing state over 10 000 thousands times per second (OFF/ON) 
and sending each time a 128 trap to OpenSM.
Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or 
so).
Fixing the cable will solve our problem, but I still think something should be 
done about this.

Though OpenSM behaviour was OK, it was really difficult to find where the 
performances problems came from. 
All our diagnostics tools (mostly using infiniband diags) were failing to see 
the problem.
Infiniband diags commands fail toward the faulty port but it was hard to say if 
port was faulty or if it was due to high load on the SM and dropped VL15 
messages.

I was thinking about a solution:
When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty 
GUID, lid or port guid.
If last heavy sweep was triggered by the same faulty port, we wait twice last 
the last waiting time before forcing the new heavy sweep.
If it's another source or another reason, we force the heavy sweep right then 
and set the waiting time to 0.

This way, different problem will still trigger a heavy sweep asap but if only 
one faulty links triggers it it'll sweep less and less often as it is pretty 
useless.

It should solve this case but there may still be a problem when more ports have 
the same problem...

Any idea on a way to manage this?
An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if 
they happen to often)

Thanks

Nicolas
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to