On Thu, Mar 26, 2009 at 8:37 AM, Nicolas Morey Chaisemartin
<[email protected]> wrote:
> Hi,
>
> We've noticed while setting up a new cluster a problem with OpenSM.
> As usual, there are some cable problems while plugging the cluster but one of 
> the cable was changing state over 10 000 thousands times per second (OFF/ON) 
> and sending each time a 128 trap to OpenSM.

That's a violation of the IBTA spec by those SMAs. There's a max rate
that traps are supposed to be generated at based on the response time.
That's even if the actual event rate exceeds that which sounds like
the case here. This issue has been known for several years now but
still is unfixed :-(

> Therefore, OpenSM is constantly rerouting the whole interconnect (every 10s 
> or so).

Even if the trap rate issue is resolved (chewing up lots of SM CPU),
it won't do anything about this issue.

> Fixing the cable will solve our problem, but I still think something should 
> be done about this.
>
> Though OpenSM behaviour was OK, it was really difficult to find where the 
> performances problems came from.

There should be some log messages as to the trap rate being exceeded.
Were they not present ? Which OpenSM version ?

> All our diagnostics tools (mostly using infiniband diags) were failing to see 
> the problem.
> Infiniband diags commands fail toward the faulty port but it was hard to say 
> if port was faulty or if it was due to high load on the SM and dropped VL15 
> messages.

Yes, the only thing you would observe is VL15 drops via perfquery. The
SM is the one which should be logging the trap originator which is the
way to diagnose this issue.

> I was thinking about a solution:
> When receiving a 128 trap (and it triggers a heavy sweep) we check the faulty 
> GUID, lid or port guid.
> If last heavy sweep was triggered by the same faulty port, we wait twice last 
> the last waiting time before forcing the new heavy sweep.
> If it's another source or another reason, we force the heavy sweep right then 
> and set the waiting time to 0.
>
> This way, different problem will still trigger a heavy sweep asap but if only 
> one faulty links triggers it it'll sweep less and less often as it is pretty 
> useless.
>
> It should solve this case but there may still be a problem when more ports 
> have the same problem...
>
> Any idea on a way to manage this?
> An ignore mask on traps? (ignore traps for 1 specific problem for x seconds 
> if they happen to often)

I need to think about this more before commenting.

-- Hal

> Thanks
>
> Nicolas
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to