In message <[email protected]>,Nicolas Morey Chaisemartin writes: >We've noticed while setting up a new cluster a problem with OpenSM. >As usual, there are some cable problems while plugging the cluster but one of >the cable was changing state over 10 000 thousands times per second (OFF/ON) a >nd sending each time a 128 trap to OpenSM. >Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s or > so). >Fixing the cable will solve our problem, but I still think something should be > done about this.
we have seen the same problem here locally. it seems to be a violation of the spec to send this many traps per second. >I was thinking about a solution: >When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty >GUID, lid or port guid. >If last heavy sweep was triggered by the same faulty port, we wait twice last >the last waiting time before forcing the new heavy sweep. >If it's another source or another reason, we force the heavy sweep right then >and set the waiting time to 0. > >This way, different problem will still trigger a heavy sweep asap but if only >one faulty links triggers it it'll sweep less and less often as it is pretty u >seless. > >It should solve this case but there may still be a problem when more ports hav >e the same problem... > >Any idea on a way to manage this? >An ignore mask on traps? (ignore traps for 1 specific problem for x seconds if > they happen to often) our solution was a custom patch (that might have made it into the opensm distribution) called 'babbling_port_policy'. it attempted to disable the port in question. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
