Nicolas Morey Chaisemartin wrote: > Hi, > > We've noticed while setting up a new cluster a problem with OpenSM. > As usual, there are some cable problems while plugging the cluster but one of > the cable was changing state over 10 000 thousands times per second (OFF/ON) > and sending each time a 128 trap to OpenSM. > Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s > or so). > Fixing the cable will solve our problem, but I still think something should > be done about this. > > Though OpenSM behaviour was OK, it was really difficult to find where the > performances problems came from. > All our diagnostics tools (mostly using infiniband diags) were failing to see > the problem. > Infiniband diags commands fail toward the faulty port but it was hard to say > if port was faulty or if it was due to high load on the SM and dropped VL15 > messages. > > I was thinking about a solution: > When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty > GUID, lid or port guid. > If last heavy sweep was triggered by the same faulty port, we wait twice last > the last waiting time before forcing the new heavy sweep. > If it's another source or another reason, we force the heavy sweep right then > and set the waiting time to 0.
Note that trap 128 is generated by a switch while reporting that one of his ports has changed. The changed port GUID/LID is not reported in the trap. You can change sweep_on_trap option in opensm.conf to FALSE. This should stop opensm heavy sweeps. > > This way, different problem will still trigger a heavy sweep asap but if only > one faulty links triggers it it'll sweep less and less often as it is pretty > useless. > > It should solve this case but there may still be a problem when more ports > have the same problem... > > Any idea on a way to manage this? > An ignore mask on traps? (ignore traps for 1 specific problem for x seconds > if they happen to often) > > Thanks > > Nicolas > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
