Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
There was a bug in build_mlx_header() that I commited on Thursday. The bug was a failure to edit the MAD correctly. It has been fixed in the latest build. Sorry for any inconvenience. On Fri, Feb 19, 2010 at 03:02:07PM -0800, Woodruff, Robert J wrote: I have 2 systems that have Mellanox dual port connectx cards with one of the ports connected to a and SDR switch and the other port direct connected. With today's OFED-1.5.1 daily build, the OpenSM does not seem to transition the port all the way up. If I use OFED-1.5.1-rc1, it works fine. [r...@woody-10 woody]# /etc/init.d/opensmd start Starting IB Subnet Manager.[ OK ] [r...@woody-10 woody]# /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.7.0 Hardware version: a0 Node GUID: 0x0002c90300044fa8 System image GUID: 0x0002c90300044fab Port 1: State: Armed Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0251086a Port GUID: 0x0002c90300044fa9 Port 2: State: Initializing Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300044faa [r...@woody-10 woody]# ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] OpenSM problem on today's OFED-1.5.1 daily build
I have 2 systems that have Mellanox dual port connectx cards with one of the ports connected to a and SDR switch and the other port direct connected. With today's OFED-1.5.1 daily build, the OpenSM does not seem to transition the port all the way up. If I use OFED-1.5.1-rc1, it works fine. [r...@woody-10 woody]# /etc/init.d/opensmd start Starting IB Subnet Manager.[ OK ] [r...@woody-10 woody]# /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.7.0 Hardware version: a0 Node GUID: 0x0002c90300044fa8 System image GUID: 0x0002c90300044fab Port 1: State: Armed Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0251086a Port GUID: 0x0002c90300044fa9 Port 2: State: Initializing Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300044faa [r...@woody-10 woody]# ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
On Fri, Feb 19, 2010 at 6:02 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: I have 2 systems that have Mellanox dual port connectx cards with one of the ports connected to a and SDR switch and the other port direct connected. With today's OFED-1.5.1 daily build, the OpenSM does not seem to transition the port all the way up. If I use OFED-1.5.1-rc1, it works fine. Has there been any change between those two in the management space ? [r...@woody-10 woody]# /etc/init.d/opensmd start Starting IB Subnet Manager. [ OK ] Based on the below, I'm presuming OpenSM runs on port 1. [r...@woody-10 woody]# /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Number of ports: 2 Firmware version: 2.7.0 Hardware version: a0 Node GUID: 0x0002c90300044fa8 System image GUID: 0x0002c90300044fab Port 1: State: Armed Physical state: LinkUp Rate: 10 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x0251086a Port GUID: 0x0002c90300044fa9 What state is the peer port in ? Any interesting OpenSM log messages ? -- Hal Port 2: State: Initializing Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510868 Port GUID: 0x0002c90300044faa [r...@woody-10 woody]# ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
Hal wrote, Has there been any change between those two in the management space ? I am not sure on that, but there must be some changes because it works with RC1 but fails with today's daily build. What state is the peer port in ? Any interesting OpenSM log messages ? The peer port on the other node is in the Iniaializing state. Here is the tail of the opensm log file. Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on port 0x0002c90300044fa9 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x123b Using default GUID 0x2c90300044fa9 Entering MASTER state Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state SUBNET UP Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1240 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1244 ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: Hal wrote, Has there been any change between those two in the management space ? I am not sure on that, but there must be some changes because it works with RC1 but fails with today's daily build. Could it be changes to mlx driver ? What state is the peer port in ? Any interesting OpenSM log messages ? The peer port on the other node is in the Iniaializing state. And that's an SDR switch port ? Here is the tail of the opensm log file. Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on port 0x0002c90300044fa9 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x123b Using default GUID 0x2c90300044fa9 Entering MASTER state Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state SUBNET UP Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1240 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1244 Looks like the switch SMA is not responding ? Can you try some smpquerys to it ? Is this reproducible in this environment ? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
On Fri, Feb 19, 2010 at 7:08 PM, Hal Rosenstock hal.rosenst...@gmail.com wrote: On Fri, Feb 19, 2010 at 6:47 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: Hal wrote, Has there been any change between those two in the management space ? I am not sure on that, but there must be some changes because it works with RC1 but fails with today's daily build. Could it be changes to mlx driver ? What state is the peer port in ? Any interesting OpenSM log messages ? The peer port on the other node is in the Iniaializing state. And that's an SDR switch port ? Here is the tail of the opensm log file. Feb 19 15:44:23 734840 [1C05CA90] 0x80 - Entering DISCOVERING state Feb 19 15:44:23 746070 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773455 [1C05CA90] 0x02 - osm_vendor_bind: Binding to port 0x2c90300044fa9 Feb 19 15:44:23 773501 [1C05CA90] 0x02 - osm_opensm_bind: Setting IS_SM on port 0x0002c90300044fa9 Feb 19 15:44:24 574767 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x14123b, Hop Ptr: 0x0 Feb 19 15:44:24 574798 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:24 574811 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x123b Using default GUID 0x2c90300044fa9 Entering MASTER state Feb 19 15:44:24 574879 [595F1940] 0x80 - Entering MASTER state SUBNET UP Feb 19 15:44:24 576233 [595F1940] 0x80 - SUBNET UP Feb 19 15:44:34 538093 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141240, Hop Ptr: 0x0 Feb 19 15:44:34 538114 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:34 538123 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1240 Feb 19 15:44:34 538853 [595F1940] 0x02 - SUBNET UP Feb 19 15:44:44 541415 [41A72940] 0x01 - umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping Method 0x1, Attr 0x11, TID 0x141244, Hop Ptr: 0x0 Feb 19 15:44:44 541434 [41A72940] 0x01 - Received SMP on a 1 hop path: Initial path = 0,0, Return path = 0,0 Feb 19 15:44:44 541442 [41A72940] 0x01 - sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1244 Looks like the switch SMA is not responding ? Can you try some smpquerys to it ? Also, try rebooting that switch. Is this reproducible in this environment ? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
On Fri, Feb 19, 2010 at 7:16 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: Hal wrote, Could it be changes to mlx driver ? Guess we need to look at what has changed since RC1. And that's an SDR switch port ? Yes, this is a very very old 8 port Mellanox SDR switch. Looks like the switch SMA is not responding ? Can you try some smpquerys to it ? I re-loaded the OFED-1.5.1-rc1 code, it seems to work fine, so I do not suspect the switch, Makes sense. unless the latest OpenSM or MLX driver is sending some MAD to the switch SMA that it does not understand. Is this reproducible in this environment ? Yes. happens every time. Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily build ? I suspect that will work and would show it's mlx4 as opposed to management code but maybe I'll eat my words. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
Hal wrote, Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily build ? I suspect that will work and would show it's mlx4 as opposed to management code but maybe I'll eat my words. I tried using the OpenSM from today's daily build on the core and driver from RC1 and it seems to work OK. Also, I was wrong when I said that port 1 was connected to an old SDR switch, my bad, in fact these 2 systems are direct connected. Perhaps the SMA in the driver got broken between RC1 and today's build ? ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OpenSM problem on today's OFED-1.5.1 daily build
On Fri, Feb 19, 2010 at 7:49 PM, Woodruff, Robert J robert.j.woodr...@intel.com wrote: Hal wrote, Can you run OFED-1.5.1-rc1 with the OpenSM from the failing daily build ? I suspect that will work and would show it's mlx4 as opposed to management code but maybe I'll eat my words. I tried using the OpenSM from today's daily build on the core and driver from RC1 and it seems to work OK. Also, I was wrong when I said that port 1 was connected to an old SDR switch, my bad, in fact these 2 systems are direct connected. Perhaps the SMA in the driver got broken between RC1 and today's build ? That's consistent with the timeouts in the OpenSM log. The problem is likely in the mlx4 specific part of the SMA as I think there are changes going on there (and not in the core SMA). Best to ask Mellanox about this. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg