On Fri, 2008-04-04 at 17:48 -0700, Boris Shpolyansky wrote: > Bernd, > > 0x14 is the maximal value for HOQ lifetime, which effectively disables > the mechanism. I think you shouldn't exceed this value.
True about the maximal value but any 5 bit value > 19 (up through 31) should effectively be the same thing according to the spec. I also think that OpenSM could do a better job validating and setting this and other similar optional parameters. -- Hal > Boris > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Bernd > Schubert > Sent: Friday, April 04, 2008 4:46 PM > To: Ira Weiny > Cc: [email protected] > Subject: Re: [ofa-general] XmtDiscards > > On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote: > > On Sat, 5 Apr 2008 00:12:39 +0200 > > Bernd Schubert <[EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > after I upgraded one of our clusters to opensm-3.2.1 it seems to > > > have gotten much better there, at least no further RcvSwRelayErrors, > > > > even when the cluster is in idle state and so far also no > > > SymbolErrors, which we also have seens before. > > > > > > However, after I just started a lustre stress test on 50 clients (to > > > > a lustre storage system with 20 OSS servers and 60 OSTs), > > > ibcheckerrors reports about 9000 XmtDiscards within 30 minutes. > > > > Yea, those are bad. > > > > > > > > Searching for this error I find "This is a symptom of congestion and > > > > may require tweaking either HOQ or switch lifetime values". > > > Well, I have to admit I neither know what HOQ is, nor do I know how > > > to tweak it. I also do not have an idea to set switch lifetime > > > values. I guess this isn't related to the opensm timeout option, is > it? > > > > Yes you should adjust these values. > > > > > > > > Hmm, I just found a cisci pdf describing how to set the lifetime on > > > these switches, but is this also possible on Flextronics switches? > > > > > > > I don't know about the Vendor SMs but in opensm look for the following > > > options in the opensm.opts file (Default path is: /var/cache/opensm): > > > > # The code of maximal time a packet can wait at the head of > > # transmission queue. > > # The actual time is 4.096usec * 2^<head_of_queue_lifetime> > > # The value 0x14 disables this mechanism > > head_of_queue_lifetime 0x12 > > > > # The maximal time a packet can wait at the head of queue on > > # switch port connected to a CA or router port > > leaf_head_of_queue_lifetime 0x0c > > Hmm, I first increased head_of_queue_lifetime to 0x13 and > leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go > away. So I increased head_of_queue_lifetime to 0x15 and > leaf_head_of_queue_lifetime to 0x50, but this made the fabric to > entirely crash. On the node of the master opensm I got an endless number > of messages like these: > > Apr 5 01:35:03 pfs1n2 kernel: [705448.344542] NETDEV WATCHDOG: ib0: > transmit timed out Apr 5 01:35:03 pfs1n2 kernel: [705448.349814] ib0: > transmit timeout: latency 411908 msecs Apr 5 01:35:03 pfs1n2 kernel: > [705448.355364] ib0: queue stopped 1, tx_head 441, tx_tail 377 Apr 5 > 01:35:04 pfs1n2 kernel: [705449.343495] NETDEV WATCHDOG: ib0: transmit > timed out > > The slave opensm also went into D-state and is not killable anymore :( > > Seems I have to be very careful with these settings... > > > Thanks for your help, > Bernd > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
