Nicolas Morey Chaisemartin wrote:
> Hi,
> 
> We've noticed while setting up a new cluster a problem with OpenSM.
> As usual, there are some cable problems while plugging the cluster but one of 
> the cable was changing state over 10 000 thousands times per second (OFF/ON) 
> and sending each time a 128 trap to OpenSM.
> Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s 
> or so).
> Fixing the cable will solve our problem, but I still think something should 
> be done about this.
> 
> Though OpenSM behaviour was OK, it was really difficult to find where the 
> performances problems came from. 
> All our diagnostics tools (mostly using infiniband diags) were failing to see 
> the problem.
> Infiniband diags commands fail toward the faulty port but it was hard to say 
> if port was faulty or if it was due to high load on the SM and dropped VL15 
> messages.
> 
> I was thinking about a solution:
> When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty 
> GUID, lid or port guid.
> If last heavy sweep was triggered by the same faulty port, we wait twice last 
> the last waiting time before forcing the new heavy sweep.
> If it's another source or another reason, we force the heavy sweep right then 
> and set the waiting time to 0.

Note that trap 128 is generated by a switch while reporting that one of his 
ports has changed.
The changed port GUID/LID is not reported in the trap.

You can change sweep_on_trap option in opensm.conf to FALSE.
This should stop opensm heavy sweeps.

> 
> This way, different problem will still trigger a heavy sweep asap but if only 
> one faulty links triggers it it'll sweep less and less often as it is pretty 
> useless.
> 
> It should solve this case but there may still be a problem when more ports 
> have the same problem...
> 
> Any idea on a way to manage this?
> An ignore mask on traps? (ignore traps for 1 specific problem for x seconds 
> if they happen to often)
> 
> Thanks
> 
> Nicolas
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to