On Sat, 2008-04-05 at 01:45 +0200, Bernd Schubert wrote: > On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote: > > On Sat, 5 Apr 2008 00:12:39 +0200 > > Bernd Schubert <[EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > after I upgraded one of our clusters to opensm-3.2.1 it seems to have > > > gotten > > > much better there, at least no further RcvSwRelayErrors, even when the > > > cluster is in idle state and so far also no SymbolErrors, which we also > > > have > > > seens before. > > > > > > However, after I just started a lustre stress test on 50 clients (to a > > > lustre > > > storage system with 20 OSS servers and 60 OSTs), ibcheckerrors reports > > > about > > > 9000 XmtDiscards within 30 minutes. > > > > Yea, those are bad. > > > > > > > > Searching for this error I find "This is a symptom of congestion and may > > > require tweaking either HOQ or switch lifetime values". > > > Well, I have to admit I neither know what HOQ is, nor do I know how to > > > tweak > > > it. I also do not have an idea to set switch lifetime values. I guess > > > this > > > isn't related to the opensm timeout option, is it? > > > > Yes you should adjust these values. > > > > > > > > Hmm, I just found a cisci pdf describing how to set the lifetime on these > > > switches, but is this also possible on Flextronics switches? > > > > > > > I don't know about the Vendor SMs but in opensm look for the following > > options > > in the opensm.opts file (Default path is: /var/cache/opensm): > > > > # The code of maximal time a packet can wait at the head of > > # transmission queue. > > # The actual time is 4.096usec * 2^<head_of_queue_lifetime> > > # The value 0x14 disables this mechanism > > head_of_queue_lifetime 0x12 > > > > # The maximal time a packet can wait at the head of queue on > > # switch port connected to a CA or router port > > leaf_head_of_queue_lifetime 0x0c > > Hmm, I first increased head_of_queue_lifetime to 0x13 and > leaf_head_of_queue_lifetime to 0x20, but this didn't make the error > go away. So I increased head_of_queue_lifetime to 0x15 and > leaf_head_of_queue_lifetime to 0x50, but this made the fabric to entirely > crash. On the node of the master opensm I got an endless number of messages > like these: > > Apr 5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: transmit > timed out > Apr 5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: transmit timeout: latency > 411908 msecs > Apr 5 01:35:03 pfs1n2 kernel: [705448.355364] ib0: queue stopped 1, tx_head > 441, tx_tail 377 > Apr 5 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit > timed out > > The slave opensm also went into D-state and is not killable anymore :( > > Seems I have to be very careful with these settings...
Yes, those settings are not for the faint of heart and one needs to really understand what changes to those parameters really mean. As far as the slave opensm behavior, this is worth understanding more IMO. -- Hal > Thanks for your help, > Bernd > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
