Jim, On Wed, 2008-06-04 at 13:14 -0600, Jim Schutt wrote: > On Tue, 2008-06-03 at 11:13 -0600, Hal Rosenstock wrote: > > Steve, > > > > One more thought below... > > > > On Tue, 2008-06-03 at 09:49 -0700, Hal Rosenstock wrote: > > > Steve, > > > > > > On Tue, 2008-06-03 at 11:19 -0500, Steve Wise wrote: > > > > Hello opensm gurus: > > > > > > > > Sandia is seeing problems after migrating up to ofed-1.3. They are > > > > still using an ofed-1.2 opensm but with ofed-1.3 clients, updated from > > > > ofed-1.2.5. > > > > > > Was the OpenSM node changed in some way or only the end nodes ? > > Only the end nodes were changed to ofed-1.3. > > > > > > > > They are getting the errors below. > > > > > > > > Q: should this work? Or are the backwards compat issues? > > > > > > I haven't explictly tried it but I would think it should work. > > > > > > The errors below are timeouts on switch MFT sets which are only > > > indirectly related to the end nodes (in that the MC SA joins cause the > > > MC routing and those tables to be set) so I don't see the relationship > > > but might be missing something. > > > > > > -- Hal > > > > > > > Thanks, > > > > > > > > Steve. > > > > > > > > Could this switch SMA be "stuck" ? > > > > Could you try smpquery -D nodeinfo 0,1,14,9 > > and > > smpquery -D nodeinfo 0,1,14 > > from the SM node ? > > > > -- Hal > > > So I've stopped the 1.3 opensm (which was not logging anything)
Perhaps this OpenSM was not master. > and started up the 1.2 opensm on another node. After a short time > (<1 min) it starting logging these again: That's likely when it becomes master again. > ------ > > Jun 04 11:51:13 063963 [45007960] -> umad_receiver: ERR 5409: send completed > with error (method=0x2 attr=0x1B trans_id=0x23000406da) -- dropping > Jun 04 11:51:13 063973 [45007960] -> umad_receiver: ERR 5411: DR SMP Hop Ptr: > 0x0 > Jun 04 11:51:13 063986 [45007960] -> Received SMP on a 3 hop path: > Initial path = 0,0,0,0 > Return path = 0,0,0,0 > Jun 04 11:51:13 063996 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3113: > MAD completed in error (IB_TIMEOUT) > Jun 04 11:51:13 064004 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR 3119: > Set method failed > Jun 04 11:51:13 064034 [45007960] -> SMP dump: > base_ver................0x1 > mgmt_class..............0x81 > class_ver...............0x1 > method..................0x2 (SubnSet) > D bit...................0x0 > status..................0x0 > hop_ptr.................0x0 > hop_count...............0x3 > trans_id................0x406da > attr_id.................0x1B > (MulticastForwardingTable) > resv....................0x0 > attr_mod................0x10000000 > m_key...................0x0000000000000000 > dr_slid.................0xFFFF > dr_dlid.................0xFFFF > > Initial path: 0,1,13,8 > Return path: 0,0,0,0 > Reserved: [0][0][0][0][0][0][0] > > 00 40 00 40 00 00 00 00 00 00 00 00 00 00 > 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 40 > 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 > > Jun 04 11:51:13 067533 [42803960] -> Errors during initialization > Jun 04 11:51:13 067607 [42803960] -> __osm_state_mgr_init_errors_msg: > > ------ > > I tried your smpquery suggestion from the node running > the 1.2 opensm: > > # smpquery -D nodeinfo 0,1,13,8 > # Node info: DR path [0][1][13][8] > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x00066a0803000107 > Guid:............................0x00066a00010001e8 > PortGuid:........................0x00066a00010001e8 > PartCap:.........................8 > DevId:...........................0xb924 > Revision:........................0x000000a1 > LocalPort:.......................24 > VendorId:........................0x00066a > # smpquery -D nodeinfo 0,1,13 > # Node info: DR path [0][1][13] > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x00066a0867000107 > Guid:............................0x00066a0004000133 > PortGuid:........................0x00066a0004000133 > PartCap:.........................8 > DevId:...........................0xb924 > Revision:........................0x000000a1 > LocalPort:.......................17 > VendorId:........................0x00066a > # smpquery -D nodeinfo 0,1 > # Node info: DR path [0][1] > BaseVers:........................1 > ClassVers:.......................1 > NodeType:........................Switch > NumPorts:........................24 > SystemGuid:......................0x00066a0808000107 > Guid:............................0x00066a000100015e > PortGuid:........................0x00066a000100015e > PartCap:.........................8 > DevId:...........................0xb924 > Revision:........................0x000000a1 > LocalPort:.......................7 > VendorId:........................0x00066a So the switch SMAs are fine. Can you do: saquery -s as it seems there are could be more SMs in your subnet. > I logged 43 MB of the above errors in 11 minutes. > While opensm was logging those errors, I tried pinging > another node in the fabric via IPoIB, and it was > reachable. > > I stopped the 1.2 opensm and restarted the 1.3 opensm. > It logged nothing after the initial startup messages. Likely it is not master. One way to check this via SMLID comparison. -- Hal > -- Jim > > > _______________________________________________ ewg mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
