Hi Bernd, On Sat, 2008-04-05 at 00:12 +0200, Bernd Schubert wrote: > Hello, > > after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten > much better there, at least no further RcvSwRelayErrors, even when the > cluster is in idle state and so far also no SymbolErrors, which we also have > seens before. > > However, after I just started a lustre stress test on 50 clients (to a lustre > storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about > 9000 XmtDiscards within 30 minutes. > > Searching for this error I find "This is a symptom of congestion and may > require tweaking either HOQ or switch lifetime values". > Well, I have to admit I neither know what HOQ is, nor do I know how to tweak > it. I also do not have an idea to set switch lifetime values. I guess this > isn't related to the opensm timeout option, is it? > > Hmm, I just found a cisci pdf describing how to set the lifetime on these > switches, but is this also possible on Flextronics switches?
What routing algorithm are you using ? Rather than play with those switch values, if you are not using up/down, could you try that to see if it helps with the congestion you are seeing ? -- Hal > Thanks for any help, > Bernd _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
