Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even when the cluster is in idle state and so far also no SymbolErrors, which we also have seens before.
However, after I just started a lustre stress test on 50 clients (to a lustre storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. Searching for this error I find "This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values". Well, I have to admit I neither know what HOQ is, nor do I know how to tweak it. I also do not have an idea to set switch lifetime values. I guess this isn't related to the opensm timeout option, is it? Hmm, I just found a cisci pdf describing how to set the lifetime on these switches, but is this also possible on Flextronics switches? Thanks for any help, Bernd -- Bernd Schubert Q-Leap Networks GmbH _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
