On Fri, 3 Sep 2010 14:04:37 -0700
Chuck Hartley <[email protected]> wrote:

> I checked another working  fabric here and also see the same warnings,
> so it looks like the warnings are not really a problem.

Yes I think you should consider those warnings not errors.

> 
> Well, I assume that it is just IPoIB that isn't working. Since ibping
> works, I believe that says the IB part is ok. Of course, I can't run
> any of the perftools since they all need IPoIB to resolve the host IP.
> 
> Do you have any suggestions of what to check to diagnose the IPoIB
> problem?

Can you log into the nodes or do you have console output?  Is ib0 up?

Ira

> Specifically, can you think of any interaction with the
> "normal" networking stuff in the kernel that might be misconfigured?
> The reason I mention that is because I rebuilt/installed OFED (no
> errors/warnings) and it is in its default configuration, which is
> running well on other similar fabrics here.  Therefore I assume the
> problem must be with the non-OFED stuff. Previously, whenever this
> kind of problem cropped up it has always been because opensm was not
> running. I did check that iptables was off, so it isn't a firewall
> issue.
> 
> - Chuck
> 
> 
> On Thu, Sep 2, 2010 at 4:16 PM, Ira Weiny <[email protected]> wrote:
> > On Thu, 2 Sep 2010 11:11:13 -0700
> > Chuck Hartley <[email protected]> wrote:
> >
> >> Sure, here is the output:
> >> Note this is with the switch we swapped in, so the port numbers don't
> >> match the ibchecknet output in the original message.
> >>
> >> # ibstat
> >> CA 'mlx4_0'
> >>       CA type: MT26428
> >>       Number of ports: 2
> >>       Firmware version: 2.6.0
> >>       Hardware version: a0
> >>       Node GUID: 0x0002c90300032de0
> >>       System image GUID: 0x0002c90300032de3
> >>       Port 1:
> >>               State: Active
> >>               Physical state: LinkUp
> >>               Rate: 40
> >>               Base lid: 6
> >>               LMC: 0
> >>               SM lid: 6
> >
> > Well the SM lid is set here.  Is it set on the other nodes?
> >
> > I don't run ibchecknet usually but I am getting the same errors here on a
> > working fabric...
> >
> > ibwarn: [13629] dump_perfcounters: PortXmitWait not indicated so ignore 
> > this counter
> > #warn: Lid is not configured lid 37 port 2
> > #warn: SM Lid is not configured
> > Port check lid 37 port 2:  FAILED
> >
> > Looking at this output I don't think this is an error.
> >
> > 13:17:14 > smpquery nodeinfo 37
> > # Node info: Lid 37
> > BaseVers:........................1
> > ClassVers:.......................1
> > NodeType:........................Switch
> > NumPorts:........................24
> > ...
> >
> > On switch external Ports the Lid and SMLid are not used.
> >
> > Hal, would you concur?
> >
> > Chuck,
> > Is it just that IPoIB is not working for you?
> >
> > Ira
> >
> >
> >>               Capability mask: 0x0251086a
> >>               Port GUID: 0x0002c90300032de1
> >>       Port 2:
> >>               State: Down
> >>               Physical state: Polling
> >>               Rate: 10
> >>               Base lid: 0
> >>               LMC: 0
> >>               SM lid: 0
> >>               Capability mask: 0x02510868
> >>               Port GUID: 0x0002c90300032de2
> >> CA 'mthca0'
> >>       CA type: MT25204
> >>       Number of ports: 1
> >>       Firmware version: 1.2.0
> >>       Hardware version: a0
> >>       Node GUID: 0x003048c64c0c0000
> >>       System image GUID: 0x003048c64c0c0003
> >>       Port 1:
> >>               State: Down
> >>               Physical state: Polling
> >>               Rate: 10
> >>               Base lid: 0
> >>               LMC: 0
> >>               SM lid: 0
> >>               Capability mask: 0x02510a68
> >>               Port GUID: 0x003048c64c0c0001
> >>
> >> # iblinkinfo
> >> Switch 0x0002c9020041a7a0 Infiniscale-IV Mellanox Technologies:
> >>            1    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       5
> >> 1[  ] " HCA-1" ( )
> >>            1    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       6
> >> 1[  ] "linux70 HCA-1" ( )
> >>            1    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       7
> >> 1[  ] "linux71 HCA-1" ( )
> >>            1    4[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1    5[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1    6[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1    7[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1    8[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1    9[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   10[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   11[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   12[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   13[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   14[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   15[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   16[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   17[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   18[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   19[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   20[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   21[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   22[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   23[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   24[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==>       9
> >> 1[  ] " HCA-1" ( )
> >>            1   25[  ] ==( 4X 5.0 Gbps Active/  LinkUp)==>       8
> >> 1[  ] " HCA-1" ( )
> >>            1   26[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   27[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   28[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   29[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   30[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   31[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   32[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   33[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   34[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   35[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>            1   36[  ] ==( 4X 2.5 Gbps   Down/ Polling)==>
> >> [  ] "" ( )
> >>
> >> On Thu, Sep 2, 2010 at 12:03 PM, Ira Weiny <[email protected]> wrote:
> >> > On Thu, 2 Sep 2010 06:56:50 -0700
> >> > Chuck Hartley <[email protected]> wrote:
> >> >
> >> >> We swapped in a different switch and see the same errors. The opensm
> >> >> logfile does not show any errors:
> >> >
> >> > Could you run "ibstat" on the node with OpenSM running?
> >> >
> >> > And "iblinkinfo" on the same node?
> >> >
> >> > Send that output.
> >> >
> >> > Ira
> >> >
> >> >>
> >> >> -------------------------------------------------
> >> >> OpenSM 3.3.5
> >> >> Command Line Arguments:
> >> >>  Daemon mode
> >> >>  Log File: /var/log/opensm.log
> >> >> -------------------------------------------------
> >> >> OpenSM 3.3.5
> >> >>
> >> >> Sep 02 05:56:29 933684 [B53B8700] 0x80 -> OpenSM 3.3.5
> >> >> Entering DISCOVERING state
> >> >>
> >> >> Sep 02 05:56:29 934931 [B53B8700] 0x02 -> osm_vendor_init: 1000
> >> >> pending umads specified
> >> >> Sep 02 05:56:29 935079 [B53B8700] 0x80 -> Entering DISCOVERING state
> >> >> Using default GUID 0x2c90300032de1
> >> >> Entering MASTER state
> >> >>
> >> >> Sep 02 05:56:29 953763 [B53B8700] 0x02 -> osm_vendor_bind: Binding to
> >> >> port 0x2c90300032de1
> >> >> Sep 02 05:56:29 990146 [B53B8700] 0x02 -> osm_vendor_bind: Binding to
> >> >> port 0x2c90300032de1
> >> >> Sep 02 05:56:29 990240 [B53B8700] 0x02 -> osm_opensm_bind: Setting
> >> >> IS_SM on port 0x0002c90300032de1
> >> >> Sep 02 05:56:30 009040 [AF1DB710] 0x80 -> Entering MASTER state
> >> >> SUBNET UP
> >> >>
> >> >> Sep 02 05:56:30 009885 [AF1DB710] 0x02 -> osm_ucast_mgr_process:
> >> >> minhop tables configured on all switches
> >> >> Sep 02 05:56:30 014593 [AF1DB710] 0x80 -> SUBNET UP
> >> >>
> >> >>
> >> >> On Thu, Sep 2, 2010 at 8:56 AM, Hal Rosenstock 
> >> >> <[email protected]> wrote:
> >> >> > On Thu, Sep 2, 2010 at 8:34 AM, Chuck Hartley <[email protected]> 
> >> >> > wrote:
> >> >> >> Hello,
> >> >> >>
> >> >> >> We installed 1.5.1 and are having problems getting the IB fabric
> >> >> >> working. ibv_devinfo shows the HCAs ports are ok and ibdiagnet 
> >> >> >> reports
> >> >> >> no errors. However, ibchecknet shows that the switch ports are not
> >> >> >> being configured.  We have never seen this before and are at a loss 
> >> >> >> as
> >> >> >> to where the problem might be - would someone please point us in the
> >> >> >> right direction to look?  Could it be a problem with the switch
> >> >> >> itself? Output from ibchecknet below.
> >> >> >>
> >> >> >>
> >> >> >> # ibchecknet
> >> >> >> Error check on lid 3 (Infiniscale-IV Mellanox Technologies) port 
> >> >> >> all:  FAILED
> >> >> >> ibwarn: [26732] dump_perfcounters: PortXmitWait not indicated so
> >> >> >> ignore this counter
> >> >> >> #warn: Lid is not configured lid 3 port 7
> >> >> >> #warn: SM Lid is not configured
> >> >> >
> >> >> > Is there an SM running on your subnet ? If so, I think that the lack
> >> >> > of an SM could account for all of the issues mentioned here.
> >> >> >
> >> >> > -- Hal
> >> >> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> >> the body of a message to [email protected]
> >> >> More majordomo info at  http://***vger.kernel.org/majordomo-info.html
> >> >>
> >> >
> >> >
> >> > --
> >> > Ira Weiny
> >> > Math Programmer/Computer Scientist
> >> > Lawrence Livermore National Lab
> >> > 925-423-8008
> >> > [email protected]
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to [email protected]
> >> More majordomo info at  http://**vger.kernel.org/majordomo-info.html
> >>
> >
> >
> > --
> > Ira Weiny
> > Math Programmer/Computer Scientist
> > Lawrence Livermore National Lab
> > 925-423-8008
> > [email protected]
> >
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to