On Fri, 3 Sep 2010 14:04:37 -0700 Chuck Hartley <[email protected]> wrote:
> I checked another working fabric here and also see the same warnings, > so it looks like the warnings are not really a problem. Yes I think you should consider those warnings not errors. > > Well, I assume that it is just IPoIB that isn't working. Since ibping > works, I believe that says the IB part is ok. Of course, I can't run > any of the perftools since they all need IPoIB to resolve the host IP. > > Do you have any suggestions of what to check to diagnose the IPoIB > problem? Can you log into the nodes or do you have console output? Is ib0 up? Ira > Specifically, can you think of any interaction with the > "normal" networking stuff in the kernel that might be misconfigured? > The reason I mention that is because I rebuilt/installed OFED (no > errors/warnings) and it is in its default configuration, which is > running well on other similar fabrics here. Therefore I assume the > problem must be with the non-OFED stuff. Previously, whenever this > kind of problem cropped up it has always been because opensm was not > running. I did check that iptables was off, so it isn't a firewall > issue. > > - Chuck > > > On Thu, Sep 2, 2010 at 4:16 PM, Ira Weiny <[email protected]> wrote: > > On Thu, 2 Sep 2010 11:11:13 -0700 > > Chuck Hartley <[email protected]> wrote: > > > >> Sure, here is the output: > >> Note this is with the switch we swapped in, so the port numbers don't > >> match the ibchecknet output in the original message. > >> > >> # ibstat > >> CA 'mlx4_0' > >> CA type: MT26428 > >> Number of ports: 2 > >> Firmware version: 2.6.0 > >> Hardware version: a0 > >> Node GUID: 0x0002c90300032de0 > >> System image GUID: 0x0002c90300032de3 > >> Port 1: > >> State: Active > >> Physical state: LinkUp > >> Rate: 40 > >> Base lid: 6 > >> LMC: 0 > >> SM lid: 6 > > > > Well the SM lid is set here. Is it set on the other nodes? > > > > I don't run ibchecknet usually but I am getting the same errors here on a > > working fabric... > > > > ibwarn: [13629] dump_perfcounters: PortXmitWait not indicated so ignore > > this counter > > #warn: Lid is not configured lid 37 port 2 > > #warn: SM Lid is not configured > > Port check lid 37 port 2: FAILED > > > > Looking at this output I don't think this is an error. > > > > 13:17:14 > smpquery nodeinfo 37 > > # Node info: Lid 37 > > BaseVers:........................1 > > ClassVers:.......................1 > > NodeType:........................Switch > > NumPorts:........................24 > > ... > > > > On switch external Ports the Lid and SMLid are not used. > > > > Hal, would you concur? > > > > Chuck, > > Is it just that IPoIB is not working for you? > > > > Ira > > > > > >> Capability mask: 0x0251086a > >> Port GUID: 0x0002c90300032de1 > >> Port 2: > >> State: Down > >> Physical state: Polling > >> Rate: 10 > >> Base lid: 0 > >> LMC: 0 > >> SM lid: 0 > >> Capability mask: 0x02510868 > >> Port GUID: 0x0002c90300032de2 > >> CA 'mthca0' > >> CA type: MT25204 > >> Number of ports: 1 > >> Firmware version: 1.2.0 > >> Hardware version: a0 > >> Node GUID: 0x003048c64c0c0000 > >> System image GUID: 0x003048c64c0c0003 > >> Port 1: > >> State: Down > >> Physical state: Polling > >> Rate: 10 > >> Base lid: 0 > >> LMC: 0 > >> SM lid: 0 > >> Capability mask: 0x02510a68 > >> Port GUID: 0x003048c64c0c0001 > >> > >> # iblinkinfo > >> Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies: > >> 1 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 5 > >> 1[ ] " HCA-1" ( ) > >> 1 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 6 > >> 1[ ] "linux70 HCA-1" ( ) > >> 1 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 7 > >> 1[ ] "linux71 HCA-1" ( ) > >> 1 4[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 5[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 6[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 7[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 8[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 9[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 10[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 12[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 14[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 15[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 16[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 17[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 18[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 19[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 20[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 21[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 22[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 23[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 24[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 9 > >> 1[ ] " HCA-1" ( ) > >> 1 25[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 8 > >> 1[ ] " HCA-1" ( ) > >> 1 26[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 27[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 28[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 29[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 30[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 31[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 32[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 33[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 34[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 35[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> 1 36[ ] ==( 4X 2.5 Gbps Down/ Polling)==> > >> [ ] "" ( ) > >> > >> On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny <[email protected]> wrote: > >> > On Thu, 2 Sep 2010 06:56:50 -0700 > >> > Chuck Hartley <[email protected]> wrote: > >> > > >> >> We swapped in a different switch and see the same errors. The opensm > >> >> logfile does not show any errors: > >> > > >> > Could you run "ibstat" on the node with OpenSM running? > >> > > >> > And "iblinkinfo" on the same node? > >> > > >> > Send that output. > >> > > >> > Ira > >> > > >> >> > >> >> ------------------------------------------------- > >> >> OpenSM 3.3.5 > >> >> Command Line Arguments: > >> >> Daemon mode > >> >> Log File: /var/log/opensm.log > >> >> ------------------------------------------------- > >> >> OpenSM 3.3.5 > >> >> > >> >> Sep 02 05:56:29 933684 [B53B8700] 0x80 -> OpenSM 3.3.5 > >> >> Entering DISCOVERING state > >> >> > >> >> Sep 02 05:56:29 934931 [B53B8700] 0x02 -> osm_vendor_init: 1000 > >> >> pending umads specified > >> >> Sep 02 05:56:29 935079 [B53B8700] 0x80 -> Entering DISCOVERING state > >> >> Using default GUID 0x2c90300032de1 > >> >> Entering MASTER state > >> >> > >> >> Sep 02 05:56:29 953763 [B53B8700] 0x02 -> osm_vendor_bind: Binding to > >> >> port 0x2c90300032de1 > >> >> Sep 02 05:56:29 990146 [B53B8700] 0x02 -> osm_vendor_bind: Binding to > >> >> port 0x2c90300032de1 > >> >> Sep 02 05:56:29 990240 [B53B8700] 0x02 -> osm_opensm_bind: Setting > >> >> IS_SM on port 0x0002c90300032de1 > >> >> Sep 02 05:56:30 009040 [AF1DB710] 0x80 -> Entering MASTER state > >> >> SUBNET UP > >> >> > >> >> Sep 02 05:56:30 009885 [AF1DB710] 0x02 -> osm_ucast_mgr_process: > >> >> minhop tables configured on all switches > >> >> Sep 02 05:56:30 014593 [AF1DB710] 0x80 -> SUBNET UP > >> >> > >> >> > >> >> On Thu, Sep 2, 2010 at 8:56 AM, Hal Rosenstock > >> >> <[email protected]> wrote: > >> >> > On Thu, Sep 2, 2010 at 8:34 AM, Chuck Hartley <[email protected]> > >> >> > wrote: > >> >> >> Hello, > >> >> >> > >> >> >> We installed 1.5.1 and are having problems getting the IB fabric > >> >> >> working. ibv_devinfo shows the HCAs ports are ok and ibdiagnet > >> >> >> reports > >> >> >> no errors. However, ibchecknet shows that the switch ports are not > >> >> >> being configured. We have never seen this before and are at a loss > >> >> >> as > >> >> >> to where the problem might be - would someone please point us in the > >> >> >> right direction to look? Could it be a problem with the switch > >> >> >> itself? Output from ibchecknet below. > >> >> >> > >> >> >> > >> >> >> # ibchecknet > >> >> >> Error check on lid 3 (Infiniscale-IV Mellanox Technologies) port > >> >> >> all: FAILED > >> >> >> ibwarn: [26732] dump_perfcounters: PortXmitWait not indicated so > >> >> >> ignore this counter > >> >> >> #warn: Lid is not configured lid 3 port 7 > >> >> >> #warn: SM Lid is not configured > >> >> > > >> >> > Is there an SM running on your subnet ? If so, I think that the lack > >> >> > of an SM could account for all of the issues mentioned here. > >> >> > > >> >> > -- Hal > >> >> > > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> >> the body of a message to [email protected] > >> >> More majordomo info at http://***vger.kernel.org/majordomo-info.html > >> >> > >> > > >> > > >> > -- > >> > Ira Weiny > >> > Math Programmer/Computer Scientist > >> > Lawrence Livermore National Lab > >> > 925-423-8008 > >> > [email protected] > >> > > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> the body of a message to [email protected] > >> More majordomo info at http://**vger.kernel.org/majordomo-info.html > >> > > > > > > -- > > Ira Weiny > > Math Programmer/Computer Scientist > > Lawrence Livermore National Lab > > 925-423-8008 > > [email protected] > > > -- Ira Weiny Math Programmer/Computer Scientist Lawrence Livermore National Lab 925-423-8008 [email protected] -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
