Hi Jim, On Wed, 2008-06-04 at 13:59 -0600, Jim Schutt wrote: > Hi Hal, > > I've just discovered that what I thought was ofed-1.2 opensm > is really ofed-1.2-rc2, if it matters.
I don't recall what the differences were but let's assume it's not significant for now. > Anyway: > > On Wed, 2008-06-04 at 13:27 -0600, Hal Rosenstock wrote: > > Jim, > > > > On Wed, 2008-06-04 at 13:14 -0600, Jim Schutt wrote: > > > On Tue, 2008-06-03 at 11:13 -0600, Hal Rosenstock wrote: > > > > Steve, > > > > > > > > One more thought below... > > > > > > > > On Tue, 2008-06-03 at 09:49 -0700, Hal Rosenstock wrote: > > > > > Steve, > > > > > > > > > > On Tue, 2008-06-03 at 11:19 -0500, Steve Wise wrote: > > > > > > Hello opensm gurus: > > > > > > > > > > > > Sandia is seeing problems after migrating up to ofed-1.3. They are > > > > > > still using an ofed-1.2 opensm but with ofed-1.3 clients, updated > > > > > > from > > > > > > ofed-1.2.5. > > > > > > > > > > Was the OpenSM node changed in some way or only the end nodes ? > > > > > > Only the end nodes were changed to ofed-1.3. > > > > > > > > > > > > > > They are getting the errors below. > > > > > > > > > > > > Q: should this work? Or are the backwards compat issues? > > > > > > > > > > I haven't explictly tried it but I would think it should work. > > > > > > > > > > The errors below are timeouts on switch MFT sets which are only > > > > > indirectly related to the end nodes (in that the MC SA joins cause the > > > > > MC routing and those tables to be set) so I don't see the relationship > > > > > but might be missing something. > > > > > > > > > > -- Hal > > > > > > > > > > > Thanks, > > > > > > > > > > > > Steve. > > > > > > > > > > > > > > > > Could this switch SMA be "stuck" ? > > > > > > > > Could you try smpquery -D nodeinfo 0,1,14,9 > > > > and > > > > smpquery -D nodeinfo 0,1,14 > > > > from the SM node ? > > > > > > > > -- Hal > > > > > > > > > So I've stopped the 1.3 opensm (which was not logging anything) > > > > Perhaps this OpenSM was not master. > > > > > and started up the 1.2 opensm on another node. After a short time > > > (<1 min) it starting logging these again: > > > > That's likely when it becomes master again. > > > > > ------ > > > > > > Jun 04 11:51:13 063963 [45007960] -> umad_receiver: ERR 5409: send > > > completed with error (method=0x2 attr=0x1B trans_id=0x23000406da) -- > > > dropping > > > Jun 04 11:51:13 063973 [45007960] -> umad_receiver: ERR 5411: DR SMP Hop > > > Ptr: 0x0 > > > Jun 04 11:51:13 063986 [45007960] -> Received SMP on a 3 hop path: > > > Initial path = 0,0,0,0 > > > Return path = 0,0,0,0 > > > Jun 04 11:51:13 063996 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > > > 3113: MAD completed in error (IB_TIMEOUT) > > > Jun 04 11:51:13 064004 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR > > > 3119: Set method failed > > > Jun 04 11:51:13 064034 [45007960] -> SMP dump: > > > base_ver................0x1 > > > mgmt_class..............0x81 > > > class_ver...............0x1 > > > method..................0x2 (SubnSet) > > > D bit...................0x0 > > > status..................0x0 > > > hop_ptr.................0x0 > > > hop_count...............0x3 > > > trans_id................0x406da > > > attr_id.................0x1B > > > (MulticastForwardingTable) > > > resv....................0x0 > > > attr_mod................0x10000000 > > > m_key...................0x0000000000000000 > > > dr_slid.................0xFFFF > > > dr_dlid.................0xFFFF > > > > > > Initial path: 0,1,13,8 > > > Return path: 0,0,0,0 > > > Reserved: [0][0][0][0][0][0][0] > > > > > > 00 40 00 40 00 00 00 00 00 00 00 00 00 > > > 00 00 00 > > > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 40 00 00 > > > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 00 00 00 > > > > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 00 00 00 > > > > > > Jun 04 11:51:13 067533 [42803960] -> Errors during initialization > > > Jun 04 11:51:13 067607 [42803960] -> __osm_state_mgr_init_errors_msg: > > > > > > ------ > > > > > > I tried your smpquery suggestion from the node running > > > the 1.2 opensm: > > > > > > # smpquery -D nodeinfo 0,1,13,8 > > > # Node info: DR path [0][1][13][8] > > > BaseVers:........................1 > > > ClassVers:.......................1 > > > NodeType:........................Switch > > > NumPorts:........................24 > > > SystemGuid:......................0x00066a0803000107 > > > Guid:............................0x00066a00010001e8 > > > PortGuid:........................0x00066a00010001e8 > > > PartCap:.........................8 > > > DevId:...........................0xb924 > > > Revision:........................0x000000a1 > > > LocalPort:.......................24 > > > VendorId:........................0x00066a > > > # smpquery -D nodeinfo 0,1,13 > > > # Node info: DR path [0][1][13] > > > BaseVers:........................1 > > > ClassVers:.......................1 > > > NodeType:........................Switch > > > NumPorts:........................24 > > > SystemGuid:......................0x00066a0867000107 > > > Guid:............................0x00066a0004000133 > > > PortGuid:........................0x00066a0004000133 > > > PartCap:.........................8 > > > DevId:...........................0xb924 > > > Revision:........................0x000000a1 > > > LocalPort:.......................17 > > > VendorId:........................0x00066a > > > # smpquery -D nodeinfo 0,1 > > > # Node info: DR path [0][1] > > > BaseVers:........................1 > > > ClassVers:.......................1 > > > NodeType:........................Switch > > > NumPorts:........................24 > > > SystemGuid:......................0x00066a0808000107 > > > Guid:............................0x00066a000100015e > > > PortGuid:........................0x00066a000100015e > > > PartCap:.........................8 > > > DevId:...........................0xb924 > > > Revision:........................0x000000a1 > > > LocalPort:.......................7 > > > VendorId:........................0x00066a > > > > > > So the switch SMAs are fine. > > > > Can you do: > > saquery -s > > as it seems there are could be more SMs in your subnet. > > On the node running the 1.3 opensm: > > # saquery -s > IsSM ports > PortInfoRecord dump: > EndPortLid..............0x1 > PortNum.................0x1 > base_lid................0x1 > master_sm_base_lid......0x1 > capability_mask.........0x2510A6A > > IsSMdisabled ports > > # service opensmd stop > Stopping IB Subnet Manager..-. [ OK ] > > # saquery -s > Query SA failed: IB_TIMEOUT > > Then, on the node which runs the 1.2-rc2 opensm: > > # saquery -s > Query SA failed: IB_TIMEOUT > > # service opensmd start > Starting IB Subnet Manager [ OK ] > > # saquery -s > IsSM ports > PortInfoRecord dump: > EndPortLid..............0x88 > PortNum.................0x1 > base_lid................0x88 > master_sm_base_lid......0x88 > capability_mask.........0x2510A6A > > IsSMdisabled ports > > Then, back on the node where I would run the 1.3 opensm: > > # saquery -s > IsSM ports > PortInfoRecord dump: > EndPortLid..............0x88 > PortNum.................0x1 > base_lid................0x88 > master_sm_base_lid......0x88 > capability_mask.........0x2510A6A > > IsSMdisabled ports OK; this clearly shows there's only 1 SM at a time here. Did this exact cluster work fine (with OFED 1.2 (rc2) OpenSM) when the end nodes were OFED 1.2 rather than 1.3 and that was the only change ? Did the cluster size change too by any chance ? How large a cluster is this ? (There were some fixes here which should help for OFED 1.3). What opensm command line and config file options are being used to start the OpenSMs ? Are you trying to stick with the OFED 1.2 (rc2) OpenSM or would the OFED 1.3 OpenSM be OK if it worked in your environment ? Sorry for all the questions but I'm trying to come up with a theory on what's not right. -- Hal > -- Jim > > > > > > I logged 43 MB of the above errors in 11 minutes. > > > While opensm was logging those errors, I tried pinging > > > another node in the fabric via IPoIB, and it was > > > reachable. > > > > > > I stopped the 1.2 opensm and restarted the 1.3 opensm. > > > It logged nothing after the initial startup messages. > > > > Likely it is not master. One way to check this via SMLID comparison. > > > > -- Hal > > > > > -- Jim > > > > > > > > > > > > > > > _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
