Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-21 Thread Eli Cohen
There was a bug in build_mlx_header() that I commited on Thursday. The
bug was a failure to edit the MAD correctly. It has been fixed in the
latest build. Sorry for any inconvenience.

On Fri, Feb 19, 2010 at 03:02:07PM -0800, Woodruff, Robert J wrote:
 
 I have 2 systems that have Mellanox dual port connectx
 cards with one of the ports connected to a and SDR switch
 and the other port direct connected.
 
 With today's OFED-1.5.1 daily build, the OpenSM does
 not seem to transition the port all the way up.
 If I use OFED-1.5.1-rc1, it works fine.
 
 [r...@woody-10 woody]# /etc/init.d/opensmd start
 Starting IB Subnet Manager.[  OK  ]
 [r...@woody-10 woody]# /usr/sbin/ibstat
 CA 'mlx4_0'
 CA type: MT26428
 Number of ports: 2
 Firmware version: 2.7.0
 Hardware version: a0
 Node GUID: 0x0002c90300044fa8
 System image GUID: 0x0002c90300044fab
 Port 1:
 State: Armed
 Physical state: LinkUp
 Rate: 10
 Base lid: 1
 LMC: 0
 SM lid: 1
 Capability mask: 0x0251086a
 Port GUID: 0x0002c90300044fa9
 Port 2:
 State: Initializing
 Physical state: LinkUp
 Rate: 40
 Base lid: 0
 LMC: 0
 SM lid: 0
 Capability mask: 0x02510868
 Port GUID: 0x0002c90300044faa
 [r...@woody-10 woody]# 
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Woodruff, Robert J

I have 2 systems that have Mellanox dual port connectx
cards with one of the ports connected to a and SDR switch
and the other port direct connected.

With today's OFED-1.5.1 daily build, the OpenSM does
not seem to transition the port all the way up.
If I use OFED-1.5.1-rc1, it works fine.

[r...@woody-10 woody]# /etc/init.d/opensmd start
Starting IB Subnet Manager.[  OK  ]
[r...@woody-10 woody]# /usr/sbin/ibstat
CA 'mlx4_0'
CA type: MT26428
Number of ports: 2
Firmware version: 2.7.0
Hardware version: a0
Node GUID: 0x0002c90300044fa8
System image GUID: 0x0002c90300044fab
Port 1:
State: Armed
Physical state: LinkUp
Rate: 10
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251086a
Port GUID: 0x0002c90300044fa9
Port 2:
State: Initializing
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02510868
Port GUID: 0x0002c90300044faa
[r...@woody-10 woody]# 
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 6:02 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:

 I have 2 systems that have Mellanox dual port connectx
 cards with one of the ports connected to a and SDR switch
 and the other port direct connected.

 With today's OFED-1.5.1 daily build, the OpenSM does
 not seem to transition the port all the way up.
 If I use OFED-1.5.1-rc1, it works fine.

Has there been any change between those two in the management space ?


 [r...@woody-10 woody]# /etc/init.d/opensmd start
 Starting IB Subnet Manager.                                [  OK  ]

Based on the below, I'm presuming OpenSM runs on port 1.

 [r...@woody-10 woody]# /usr/sbin/ibstat
 CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 2
        Firmware version: 2.7.0
        Hardware version: a0
        Node GUID: 0x0002c90300044fa8
        System image GUID: 0x0002c90300044fab
        Port 1:
                State: Armed
                Physical state: LinkUp
                Rate: 10
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251086a
                Port GUID: 0x0002c90300044fa9

What state is the peer port in ? Any interesting OpenSM log messages ?

-- Hal

        Port 2:
                State: Initializing
                Physical state: LinkUp
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02510868
                Port GUID: 0x0002c90300044faa
 [r...@woody-10 woody]#
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Woodruff, Robert J
Hal wrote,

Has there been any change between those two in the management space ?

I am not sure on that, but there must be some changes because it
works with RC1 but fails with today's daily build.


What state is the peer port in ? Any interesting OpenSM log messages ?

The peer port on the other node is in the Iniaializing state.

Here is the tail of the opensm log file.


Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state
Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
0x2c90300044fa9
Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
0x2c90300044fa9
Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on 
port 0x0002c90300044fa9
Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send 
completed with error -- dropping
Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0
Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial 
path = 0,0, Return path  = 0,0
Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x123b
Using default GUID 0x2c90300044fa9
Entering MASTER state

Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state
SUBNET UP

Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP
Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send 
completed with error -- dropping
Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0
Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial 
path = 0,0, Return path  = 0,0
Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1240
Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP
Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send 
completed with error -- dropping
Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0
Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial 
path = 0,0, Return path  = 0,0
Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1244
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Has there been any change between those two in the management space ?

 I am not sure on that, but there must be some changes because it
 works with RC1 but fails with today's daily build.

Could it be changes to mlx driver ?

What state is the peer port in ? Any interesting OpenSM log messages ?

 The peer port on the other node is in the Iniaializing state.

And that's an SDR switch port ?

 Here is the tail of the opensm log file.


 Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state
 Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on 
 port 0x0002c90300044fa9
 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0
 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x123b
 Using default GUID 0x2c90300044fa9
 Entering MASTER state

 Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state
 SUBNET UP

 Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP
 Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0
 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1240
 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP
 Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0
 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1244

Looks like the switch SMA is not responding ? Can you try some smpquerys to it ?

Is this reproducible in this environment ?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:08 PM, Hal Rosenstock
hal.rosenst...@gmail.com wrote:
 On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J
 robert.j.woodr...@intel.com wrote:
 Hal wrote,

Has there been any change between those two in the management space ?

 I am not sure on that, but there must be some changes because it
 works with RC1 but fails with today's daily build.

 Could it be changes to mlx driver ?

What state is the peer port in ? Any interesting OpenSM log messages ?

 The peer port on the other node is in the Iniaializing state.

 And that's an SDR switch port ?

 Here is the tail of the opensm log file.


 Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state
 Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 
 0x2c90300044fa9
 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on 
 port 0x0002c90300044fa9
 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0
 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x123b
 Using default GUID 0x2c90300044fa9
 Entering MASTER state

 Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state
 SUBNET UP

 Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP
 Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0
 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1240
 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP
 Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP 
 Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0
 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: 
 Initial path = 0,0, Return path  = 0,0
 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: 
 MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 
 0x1244

 Looks like the switch SMA is not responding ? Can you try some smpquerys to 
 it ?

Also, try rebooting that switch.


 Is this reproducible in this environment ?

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:16 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Could it be changes to mlx driver ?

 Guess we need to look at what has changed since RC1.

And that's an SDR switch port ?

 Yes, this is a very very old 8 port Mellanox SDR switch.

Looks like the switch SMA is not responding ? Can you try some smpquerys to 
it ?

 I re-loaded the OFED-1.5.1-rc1 code, it seems to work fine, so I do not 
 suspect the switch,

Makes sense.

 unless the latest OpenSM or MLX driver is sending some MAD to the switch SMA 
 that it
 does not understand.

Is this reproducible in this environment ?

 Yes. happens every time.

Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily
build ? I suspect that will work and would show it's mlx4 as opposed
to management code but maybe I'll eat my words.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Woodruff, Robert J
Hal wrote,

Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily
build ? I suspect that will work and would show it's mlx4 as opposed
to management code but maybe I'll eat my words. 


I tried using the OpenSM from today's daily build on the core and driver
from RC1 and it seems to work OK.

Also, I was wrong when I said that port 1 was connected to an old SDR 
switch, my bad, in fact these 2 systems are direct connected. Perhaps
the SMA in the driver got broken between RC1 and today's build ?
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build

2010-02-19 Thread Hal Rosenstock
On Fri, Feb 19, 2010 at 7:49 PM, Woodruff, Robert J
robert.j.woodr...@intel.com wrote:
 Hal wrote,

Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily
build ? I suspect that will work and would show it's mlx4 as opposed
to management code but maybe I'll eat my words.


 I tried using the OpenSM from today's daily build on the core and driver
 from RC1 and it seems to work OK.

 Also, I was wrong when I said that port 1 was connected to an old SDR
 switch, my bad, in fact these 2 systems are direct connected. Perhaps
 the SMA in the driver got broken between RC1 and today's build ?

That's consistent with the timeouts in the OpenSM log. The problem is
likely in the mlx4 specific part of the SMA as I think there are
changes going on there (and not in the core SMA). Best to ask Mellanox
about this.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg