On Wed, Mar 18, 2009 at 5:15 PM, jeffrey Lang <[email protected]> wrote: > I'm using the version currently that is released with Centos 4.X which shows > as "infiniband-diags-1.3.6-1.el4". found this in the syslog > > Mar 18 13:47:43 h2o01 saquery[21966]: Unable to Open IBT Device[/dev/SysIbt]
I'm unfamiliar with the Centos port and have never seen a message like that so I haven't a clue. -- Hal > Now i just need to figure out why the device entry doesn't exist. > > With the firmware update the errors below have disappeared and the IPOIB is > now working. > > jeff > > Hal Rosenstock wrote: > > On Wed, Mar 18, 2009 at 10:37 AM, jeffrey Lang <[email protected]> wrote: > > > Here's the output for ibchecknet: > [r...@h2o01 ~]# ibchecknet > perfquery: iberror: failed: smp query nodeinfo: Node type not CA > > > What diags version is being used ? > > > Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port all: > FAILED > #warn: counter SymbolErrors = 43259 (threshold 10) lid 1 port 17 > Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 17: > FAILED > #warn: counter LinkRecovers = 207 (threshold 10) lid 1 port 2 > #warn: counter RcvErrors = 112 (threshold 10) lid 1 port 2 > Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 2: > FAILED > #warn: counter LinkDowned = 10 (threshold 10) lid 1 port 1 > #warn: counter RcvErrors = 95 (threshold 10) lid 1 port 1 > Error check on lid 1 (SilverStorm 9024 GUID=0x00066a00d8000593) port 1: > FAILED > > > Are the counts for these ports (1,2,17) changing ? You can look with > perfquery 1 <port #>. > This is a separate issue from the lack of an IPoIB broadcast group > assuming these numbers are incrementing. > -- Hal > > > # Checking Ca: nodeguid 0x00066a0098007e99 > # Checking Ca: nodeguid 0x00066a0098007e9b > # Checking Ca: nodeguid 0x00066a0098007e97 > # Checking Ca: nodeguid 0x00066a0098007e8c > # Checking Ca: nodeguid 0x00066a0098007e94 > # Checking Ca: nodeguid 0x00066a0098007e93 > # Checking Ca: nodeguid 0x00066a0098007e8e > # Checking Ca: nodeguid 0x00066a0098007e90 > # Checking Ca: nodeguid 0x00066a0098007e98 > # Checking Ca: nodeguid 0x00066a0098007e95 > # Checking Ca: nodeguid 0x00066a0098007e8f > # Checking Ca: nodeguid 0x00066a0098007e92 > # Checking Ca: nodeguid 0x00066a0098007e8d > # Checking Ca: nodeguid 0x00066a0098007e91 > # Checking Ca: nodeguid 0x00066a0098007e96 > # Checking Ca: nodeguid 0x00066a0098007e9c > ## Summary: 17 nodes checked, 0 bad nodes found > ## 32 ports checked, 0 bad ports found > ## 3 ports have errors beyond threshold > I see these messages in the switch log now: > E|2009/03/18 07:34:28.635S: Thread "esm_sar" (0x83394a90) > ESM: Embedded SM Error: sa_McMemberRecord_Set: Component mask of > 0x0000000000010083 does not have bits required to create a group > (0x00000000000130C6) for new MGID of 0xFF12401BFFFF0000:00000000FFFFFFFF for > request from h2o12 HCA-1, Port 0x00066A00A0007E8E, LID 0x000C, returning > status 0x0600 : 0 > I would have to assume that this is my problem, but how to fix? > jeff > Hal Rosenstock wrote: > On Tue, Mar 17, 2009 at 6:04 PM, jeffrey Lang <[email protected]> wrote: > Here's the output smpquery portinfo -D 0 as requested below: > [r...@h2o01 ~]# smpquery portinfo -D 0 > # Port info: DR path 0 port 0 > Mkey:............................0x0000000000000000 > GidPrefix:.......................0xfe80000000000000 > Lid:.............................0x0003 > SMLid:...........................0x0001 > CapMask:.........................0x2510a68 > IsTrapSupported > IsAutomaticMigrationSupported > IsSLMappingSupported > IsLedInfoSupported > IsSystemImageGUIDsupported > IsCommunicatonManagementSupported > IsVendorClassSupported > IsCapabilityMaskNoticeSupported > IsClientRegistrationSupported > DiagCode:........................0x0000 > MkeyLeasePeriod:.................0 > LocalPort:.......................1 > LinkWidthEnabled:................1X or 4X > LinkWidthSupported:..............1X or 4X > LinkWidthActive:.................4X > LinkSpeedSupported:..............2.5 Gbps > LinkState:.......................Active > PhysLinkState:...................LinkUp > LinkDownDefState:................Polling > ProtectBits:.....................0 > LMC:.............................0 > LinkSpeedActive:.................2.5 Gbps > LinkSpeedEnabled:................2.5 Gbps > NeighborMTU:.....................2048 > SMSL:............................0 > VLCap:...........................VL0-3 > InitType:........................0x00 > VLHighLimit:.....................0 > VLArbHighCap:....................8 > VLArbLowCap:.....................8 > InitReply:.......................0x00 > MtuCap:..........................2048 > VLStallCount:....................7 > HoqLife:.........................0 > OperVLs:.........................VL0-3 > PartEnforceInb:..................0 > PartEnforceOutb:.................0 > FilterRawInb:....................0 > FilterRawOutb:...................0 > MkeyViolations:..................0 > PkeyViolations:..................0 > QkeyViolations:..................0 > GuidCap:.........................32 > ClientReregister:................0 > SubnetTimeout:...................17 > RespTimeVal:.....................16 > LocalPhysErr:....................15 > OverrunErr:......................15 > MaxCreditHint:...................0 > RoundTrip:.......................0 > Looks fine. > I did some checking, and It's not just this node having problems, all nodes > seem to be having this same problem. > Would you also run ibchecknet ? > What error messages are on the SM side ? > -- Hal > jeff > Hal Rosenstock wrote: > 2009/3/17 jeffrey Lang <[email protected]>: > First let me say, I hope this is the right list for this email, if not > please forgive me. > I have a small 16 node compute cluster. The university where I work at > recently opened a new Datacenter. My cluster was moved from the old > Datacenter. Before the move the inifiniband was working properly, after > the move the ipoib has stopped working. > The cluster runs Centos 4 with all the latest updates and the Centos > distributed OFED code. My plan was to update the OFED code once things had > restablized. > For the move, I shutdown the cluster, removed the inifiniband cables and the > cluster was moved. I then reinstalled the infiniband cables (not in the > same order before the move) and brought every thing back up. > When i brought the cluster back up the ipoib would not work. The only > message in the log file is "Mar 15 04:04:32 h2o01 kernel: ib0: multicast > join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22". > I think that there may be a rate issue in terms of this node relative > to the IPoIB broadcast group which by default is 10 Gbps (4x SDR). > What does this node's portinfo show (smpquery portinfo -D 0) in terms > of link width and speed ? > -- Hal > The master node can see all the systems: > [r...@h2o01 log]# ibnodes > Ca : 0x00066a0098007e99 ports 1 "h2o17 HCA-1" > Ca : 0x00066a0098007e9b ports 1 "h2o18 HCA-1" > Ca : 0x00066a0098007e97 ports 1 "h2o16 HCA-1" > Ca : 0x00066a0098007e8c ports 1 "h2o15 HCA-1" > Ca : 0x00066a0098007e94 ports 1 "h2o14 HCA-1" > Ca : 0x00066a0098007e93 ports 1 "h2o13 HCA-1" > Ca : 0x00066a0098007e8e ports 1 "h2o12 HCA-1" > Ca : 0x00066a0098007e90 ports 1 "h2o11 HCA-1" > Ca : 0x00066a0098007e98 ports 1 "h2o10 HCA-1" > Ca : 0x00066a0098007e95 ports 1 "h2o09 HCA-1" > Ca : 0x00066a0098007e8f ports 1 "h2o08 HCA-1" > Ca : 0x00066a0098007e92 ports 1 "h2o07 HCA-1" > Ca : 0x00066a0098007e8d ports 1 "h2o06 HCA-1" > Ca : 0x00066a0098007e91 ports 1 "h2o05 HCA-1" > Ca : 0x00066a0098007e96 ports 1 "h2ocfs HCA-1" > Ca : 0x00066a0098007e9c ports 1 "h2o01 HCA-1" > Switch : 0x00066a00d8000593 ports 24 "SilverStorm 9024 > GUID=0x00066a00d8000593" enhanced port 0 lid 1 lmc 0 > I've reset the sm on the switch, but nothing seems to work. > Any ideas of where to look for whats causing the problem? > jeff > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
