What Ami is asking you to do is to try to reproduce the problem with -RC3 or -RC4 when it's out. If the problem goes away, we'll know it's one of the bugs that got fixed since then, if not it'll be easier to debug on a recent RC.
Quoting SEGERS Koen <[EMAIL PROTECTED]>: Subject: RE: GPFS node loses IB-connection That is very difficult. This system is supposed to go in production within a few weeks. Changing the OFED drivers requires rebuilding a lot of other programs. If it isn't really necessary, I prefer not to do this... Koen -----Oorspronkelijk bericht----- Van: Ami Perlmutter [mailto:[EMAIL PROTECTED] Verzonden: dinsdag 29 mei 2007 17:05 Aan: SEGERS Koen CC: [EMAIL PROTECTED]; [email protected] Onderwerp: RE: [ofa-general] GPFS node loses IB-connection any chance of moving to rc3 (or wait till rc4)? On Tue, 2007-05-29 at 16:56 +0200, SEGERS Koen wrote: > We don't really see data getting lost. We don't get an error in the log > files of gpfs. We only got a system that was not able to read its > filesystem anymore. It was exactly at the time this FIXME error > occurred. > > Therefore I think there must me some kind of correlation. But I don't > really know what ... :( > > Koen > > -----Oorspronkelijk bericht----- > Van: Ami Perlmutter [mailto:[EMAIL PROTECTED] > Verzonden: dinsdag 29 mei 2007 16:40 > Aan: SEGERS Koen > CC: [EMAIL PROTECTED]; [email protected] > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > can you describe the scenario in which you see data lost? > does the "SDP: FIXME MID 11" message correlate with the data loss? > > On Tue, 2007-05-29 at 15:29 +0200, SEGERS Koen wrote: > > I just remembered that, with SDP, these values aren't related anymore. > > SDP doesn't give this kind of information to the OS. > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] Namens SEGERS Koen > > Verzonden: dinsdag 29 mei 2007 14:29 > > Aan: [EMAIL PROTECTED] > > CC: [EMAIL PROTECTED]; > [email protected] > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > One of the machines has 2 dropped packets: > > > > gpfswhbe2n1:~ # ifconfig ib0 > > ib0 Link encap:UNSPEC HWaddr > > 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 > > inet addr:192.168.2.1 Bcast:192.168.4.255 > Mask:255.255.255.0 > > inet6 addr: fe80::205:ad00:5:87c9/64 Scope:Link > > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > > RX packets:17311 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:19946 errors:0 dropped:2 overruns:0 carrier:0 > > collisions:0 txqueuelen:128 > > RX bytes:148363444 (141.4 Mb) TX bytes:6715076 (6.4 Mb) > > > > Can this be related? > > > > Does anyone now how this is possible with sdp? I thought SDP was a RC. > > I'm also curious how gpfs reacts to this. Do you know where I can find > > the timestamp of these dropped packets? > > > > Koen > > > > -----Oorspronkelijk bericht----- > > Van: Ami Perlmutter [mailto:[EMAIL PROTECTED] > > Verzonden: dinsdag 29 mei 2007 14:03 > > Aan: SEGERS Koen > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > [EMAIL PROTECTED]; [email protected] > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > if this is an actual resize request than there is no problem when it > is > > dropped. > > since you are running rc1, no resize requests should be sent so this > > means there is a problem since data could be dropped. do you notice > lost > > data? > > > > On Tue, 2007-05-29 at 13:37 +0200, SEGERS Koen wrote: > > > We are running ofed-1.2.RC1 on all machines. Hence it is impossible > > that > > > this message is added only a few days ago. > > > > > > How can you be so sure that this doesn't pose any problems? > > > > > > Koen > > > > > > -----Oorspronkelijk bericht----- > > > Van: Ami Perlmutter [mailto:[EMAIL PROTECTED] > > > Verzonden: dinsdag 29 mei 2007 13:35 > > > Aan: SEGERS Koen > > > CC: Scott Weitzenkamp (sweitzen); Hal Rosenstock; > > > [EMAIL PROTECTED]; [email protected] > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > this means you are getting a message your SDP does not recognize. > > > message 11 is resize request which was added to sdp a few days ago. > > > can it be that you are running 2 different versions of OFED? > > > anywas, this doesn't pose any problem so you can ignore it. > > > > > > On Tue, 2007-05-29 at 12:03 +0200, SEGERS Koen wrote: > > > > Hi, > > > > > > > > Saturday we did a different stresstest. > > > > This is what we see in the /var/log/messages: > > > > > > > > May 25 10:41:19 gpfswhbe2s1 kernel: SDP: FIXME MID 11 > > > > > > > > There were errors from that time on. Can someone explain me what > > this > > > > message does? > > > > > > > > Koen > > > > > > > > -----Oorspronkelijk bericht----- > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:[EMAIL PROTECTED] > > > > Verzonden: woensdag 23 mei 2007 17:41 > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > CC: Clive Hall (clivhall); [EMAIL PROTECTED]; > > > > [email protected] > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > Try 20 seconds, I'm curious if if you are barely crossing the > > > 10-second > > > > threshold. > > > > > > > > Scott > > > > > > > > > -----Original Message----- > > > > > From: SEGERS Koen [mailto:[EMAIL PROTECTED] > > > > > Sent: Wednesday, May 23, 2007 8:39 AM > > > > > To: Scott Weitzenkamp (sweitzen); Hal Rosenstock > > > > > Cc: Clive Hall (clivhall); > > > > > [EMAIL PROTECTED]; > > [email protected] > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > What value would you recommend then? > > > > > > > > > > Koen > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > Van: Scott Weitzenkamp (sweitzen) [mailto:[EMAIL PROTECTED] > > > > > Verzonden: woensdag 23 mei 2007 17:38 > > > > > Aan: SEGERS Koen; Hal Rosenstock > > > > > CC: Clive Hall (clivhall); > [EMAIL PROTECTED]; > > > > > [email protected] > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > The boot time of the host doesn't matter for this timeout. > While > > > the > > > > > host is booting, the IB link is down anyway. > > > > > > > > > > Scott > > > > > > > > > > > -----Original Message----- > > > > > > From: SEGERS Koen [mailto:[EMAIL PROTECTED] > > > > > > Sent: Wednesday, May 23, 2007 8:20 AM > > > > > > To: Hal Rosenstock; Scott Weitzenkamp (sweitzen) > > > > > > Cc: Clive Hall (clivhall); > > > > > > [EMAIL PROTECTED]; > > > [email protected] > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > After a whole day of stresstesting with the MAD renicing > > > > > turned on, we > > > > > > got the error once. So I think I should raise the timeout on > > > > > > the switch > > > > > > also. > > > > > > > > > > > > It takes about 2 minutes to boot the system. Do you agree > > > > > > that this is a > > > > > > good value for the timeout? > > > > > > > > > > > > Scott, > > > > > > Can you explain me the problem of the memlock? > > > > > > > > > > > > I saw that the SLES10 bug is only an issue in MVAPICH. > > > > > Since we didn't > > > > > > install this, the bug is not related to us. This is > > > > > correct, isn't it? > > > > > > > > > > > > Greetz > > > > > > > > > > > > Koen > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > Van: Hal Rosenstock [mailto:[EMAIL PROTECTED] > > > > > > Verzonden: woensdag 23 mei 2007 16:12 > > > > > > Aan: Scott "Weitzenkamp (sweitzen) > > > > > > CC: SEGERS Koen; Clive Hall (clivhall); > > > > > > [EMAIL PROTECTED]; > > > [email protected] > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > On Wed, 2007-05-23 at 09:51, Scott Weitzenkamp (sweitzen) > wrote: > > > > > > > No C code changes, just a few config file changes > > > > > (RENICE_IB_MAD=yes > > > > > > in > > > > > > > openib.conf, > > > > > > > > > > > > Does the host really not respond to MAD requests for over 10 > > > > > > seconds in > > > > > > some cases ? > > > > > > > > > > > > -- Hal > > > > > > > > > > > > > memlock in /etc/security/limits.conf, fix /etc/hosts on > > > > > > > SLES10 for bug 267, etc.). > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > SQA and Release Manager > > > > > > > Server Virtualization Business Unit > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: SEGERS Koen [mailto:[EMAIL PROTECTED] > > > > > > > > Sent: Wednesday, May 23, 2007 6:48 AM > > > > > > > > To: Scott Weitzenkamp (sweitzen); Clive Hall (clivhall) > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > [email protected]; > > > > > > [EMAIL PROTECTED] > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > This far, all tests seem to work. > > > > > > > > > > > > > > > > Thanks for the help! > > > > > > > > > > > > > > > > Scott, > > > > > > > > Are there more bugfixes that cisco does in its rpms? > > > > > > > > > > > > > > > > Greetz > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > -----Oorspronkelijk bericht----- > > > > > > > > Van: Scott Weitzenkamp (sweitzen) > > [mailto:[EMAIL PROTECTED] > > > > > > > > Verzonden: woensdag 23 mei 2007 0:39 > > > > > > > > Aan: SEGERS Koen; Scott Weitzenkamp (sweitzen); Clive Hall > > > > > > (clivhall) > > > > > > > > CC: Shirley Ma; Ami Perlmutter; > > [email protected]; > > > > > > > > [EMAIL PROTECTED] > > > > > > > > Onderwerp: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > It's not so much pinging every 10 seconds as expecting a > > > > > > > > response within > > > > > > > > 10 seconds (Clive, correct me if I'm wrong). > > > > > > > > > > > > > > > > You only need to do 1) or 2), not both. Cisco configures > 1) > > > > > > > > > > in the OFED > > > > > > > > binary RPMs we release at > > > > > > > > http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux. I > > > > > > > > prefer to have > > > > > > > > the host be more responsive. > > > > > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > SQA and Release Manager > > > > > > > > Server Virtualization Business Unit > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Koen Segers [mailto:[EMAIL PROTECTED] > > > > > > > > > Sent: Tuesday, May 22, 2007 3:35 PM > > > > > > > > > To: Scott Weitzenkamp (sweitzen) > > > > > > > > > Cc: Shirley Ma; Ami Perlmutter; > > > > > > > > > [email protected]; > > > > > > [EMAIL PROTECTED] > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses IB-connection > > > > > > > > > > > > > > > > > > If I understand it wright, the switch is actually > polling > > > > > > > > > (=pinging) the > > > > > > > > > interfaces every 10s. This means that when the interface > > is > > > > > > handling > > > > > > > > > other traffic, the poll can fail and the port could be > > > > > > > > > considered out of > > > > > > > > > service. My question is then: "How can the timeout be > > > reached > > > > > > while > > > > > > > > > packets are being sent/received to/from the interface?" > > > > > > > > > > > > > > > > > > Anyway, what timeout-value would you recommend for > > > > > us? And why? > > > > > > > > > > > > > > > > > > To recapitulate: these are the actions I'll take > tomorrow > > > > > > > > > 1) change the MAD niceness of the servers > > > > > > > > > 2) change the timeout on the switches > > > > > > > > > > > > > > > > > > Are these changes sufficient for the HCA's to keep > > > > > > their ports in > > > > > > > > > PORT_ACTIVE state? > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Koen > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 12:59 -0700, Scott Weitzenkamp > > > > > > > > (sweitzen) wrote: > > > > > > > > > > Yes, you can tune it. Here's an example via the > switch > > > CLI: > > > > > > > > > > > > > > > > > > > > SFS-7000D(config)# ib sm subnet-prefix > > > > > fe:80:00:00:00:00:00:00 > > > > > > > > > > node-timeout <value> > > > > > > > > > > > > > > > > > > > > The default is 10 seconds, it can be configured up to > > > > > > > > 2000 seconds. > > > > > > > > > > If a HCA is completely unresponsive for longer than > the > > > > > > > > node-timeout > > > > > > > > > > value, then we consider that HCA out of service. > > > > > > > > > > > > > > > > > > > > Scott Weitzenkamp > > > > > > > > > > SQA and Release Manager > > > > > > > > > > Server Virtualization Business Unit > > > > > > > > > > Cisco Systems > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________ > > > > > > > > > > From: Shirley Ma [mailto:[EMAIL PROTECTED] > > > > > > > > > > Sent: Tuesday, May 22, 2007 11:30 AM > > > > > > > > > > To: [EMAIL PROTECTED] > > > > > > > > > > Cc: Ami Perlmutter; > > [email protected]; > > > > > > > > > > [EMAIL PROTECTED]; Scott > > > > > > Weitzenkamp > > > > > > > > > > (sweitzen) > > > > > > > > > > Subject: RE: [ofa-general] GPFS node loses > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen, > > > > > > > > > > > > > > > > > > > > So it is most likely you hit the same bug as > > > > > > 229 (Scott > > > > > > > > > > pointed out earlier). The same workaround > might > > > > > > > > work for you > > > > > > > > > > by renicing ib_mad as Scott suggested. > > > > > > > > > > > > > > > > > > > > I think this should be a SM query timeout > > > > > > tunable value > > > > > > in > > > > > > > > > > Cisco SM. Am I right, Scott? > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Inactive hide details for Koen Segers > > > > > > > > > <[EMAIL PROTECTED]>Koen > > > > > > > > > > Segers <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Koen Segers > > > > > > > > > <[EMAIL PROTECTED]> > > > > > > > > > > > > > > > > > > > > 05/22/07 11:14 > > AM > > > > > > > > > > Please respond > > to > > > > > > > > > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To > > > > > > > > > > > > > > > > > > > > Shirley > > > > > > > > > > Ma/Beaverton/[EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > cc > > > > > > > > > > > > > > > > > > > > Ami Perlmutter > > > > > > > > > > <[EMAIL PROTECTED]>, > > > > > > > > > [email protected], > > > > > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > Subject > > > > > > > > > > > > > > > > > > > > RE: > > > > > > > > > > [ofa-general] > > > > > > > > > > GPFS node loses > > > > > > > > > > IB-connection > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > It is the Cisco SM. > > > > > > > > > > > > > > > > > > > > SFS-7000P> show version > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > System Version > > > Information > > > > > > > > > > > > > > > > > > > > > > ============================================================== > > > > > > > > > ================== > > > > > > > > > > system-version : SFS-7000P TopspinOS > > > > > > > > > 2.9.0 releng > > > > > > > > > > #147 > > > > > > > > > > 10/25/2006 02:01:32 > > > > > > > > > > contact : [EMAIL PROTECTED] > > > > > > > > > > name : SFS-7000P > > > > > > > > > > location : 170 West Tasman > > Drive, > > > > > > > > > San Jose, CA > > > > > > > > > > 95134 > > > > > > > > > > up-time : > 11(d):7(h):49(m):3(s) > > > > > > > > > > last-change : none > > > > > > > > > > last-config-save : none > > > > > > > > > > action : none > > > > > > > > > > result : none > > > > > > > > > > oper-mode : normal > > > > > > > > > > > > > > > > > > > > There is also a command that gives the SM > > version, > > > > > > > > > > > > but I can't > > > > > > > > > > find it > > > > > > > > > > right now. > > > > > > > > > > > > > > > > > > > > On Tue, 2007-05-22 at 09:45 -0700, Shirley Ma > > > wrote: > > > > > > > > > > > Hello Koen, > > > > > > > > > > > > > > > > > > > > > > From the switch log, it looks a SM issue to > > me. > > > > > > > > > The node was > > > > > > > > > > kicked > > > > > > > > > > > out of the membership. Which SM you are > > > > > > using in your > > > > > > > > > > fabric? > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > Shirley Ma > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > > RPR Brussel > > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > > BTW BE 0244.142.664 > > > > > > > > > RPR Brussel > > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > > > > > nv van publiek recht > > > > > > > > BTW BE 0244.142.664 > > > > > > > > RPR Brussel > > > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > general mailing list > > > > > > > [email protected] > > > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > > > > > > > To unsubscribe, please visit > > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > > > nv van publiek recht > > > > > > BTW BE 0244.142.664 > > > > > > RPR Brussel > > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > > > nv van publiek recht > > > > > BTW BE 0244.142.664 > > > > > RPR Brussel > > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > > > *** Disclaimer *** > > > > > > > > Vlaamse Radio- en Televisieomroep > > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > > > nv van publiek recht > > > > BTW BE 0244.142.664 > > > > RPR Brussel > > > > http://www.vrt.be/disclaimer > > > > > > > > > > > > _______________________________________________ > > > > general mailing list > > > > [email protected] > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > > > > > To unsubscribe, please visit > > > http://openib.org/mailman/listinfo/openib-general > > > > > > *** Disclaimer *** > > > > > > Vlaamse Radio- en Televisieomroep > > > Auguste Reyerslaan 52, 1043 Brussel > > > > > > nv van publiek recht > > > BTW BE 0244.142.664 > > > RPR Brussel > > > http://www.vrt.be/disclaimer > > > > > > > > > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > *** Disclaimer *** > > > > Vlaamse Radio- en Televisieomroep > > Auguste Reyerslaan 52, 1043 Brussel > > > > nv van publiek recht > > BTW BE 0244.142.664 > > RPR Brussel > > http://www.vrt.be/disclaimer > > > > > > *** Disclaimer *** > > Vlaamse Radio- en Televisieomroep > Auguste Reyerslaan 52, 1043 Brussel > > nv van publiek recht > BTW BE 0244.142.664 > RPR Brussel > http://www.vrt.be/disclaimer > > *** Disclaimer *** Vlaamse Radio- en Televisieomroep Auguste Reyerslaan 52, 1043 Brussel nv van publiek recht BTW BE 0244.142.664 RPR Brussel http://www.vrt.be/disclaimer _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- MST _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
