On Mon, 2013-06-17 at 22:00 +0000, Weiny, Ira wrote: > Does running "update_desc" in the console fix this?
This worked as a short term solution. But we're still thinking about a longer term one that requires less interaction. Al > Ira > > > -----Original Message----- > > From: [email protected] [mailto:linux-rdma- > > [email protected]] On Behalf Of Albert Chu > > Sent: Monday, June 17, 2013 2:38 PM > > To: [email protected] > > Subject: Node Description mismatch between saquery & smpquery > > > > We've recently noticed that the Node Description for a node can mis- > > mismatch between the output of smpquery and saquery. For example: > > > > # smpquery NodeDesc 427 > > Node Description:.................sierra1932 qib0 > > > > # saquery NodeRecord 427 | grep NodeDesc > > NodeDescription.........QLogic Infiniband HCA > > > > A restart of OpenSM is the current solution to resolve this. > > > > We've noticed it occurring more often on our larger clusters than our > > smaller > > clusters, leading to a speculation about why it is happening. > > > > The speculation is when a node comes up, there is a window of time in which > > the HCA is up, can be scanned by OpenSM, but not yet have its node > > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma). > > During this window, OpenSM reads/stores the non-desired node descriptor > > (in the above case the non-desired "Qlogic Infiniband HCA"). > > > > When the node descriptor is changed, a trap should be sent to opensm > > indicating the change. Normally OpenSM gets the trap and reads the new > > node descriptor. > > > > On our large clusters all nodes are typically brought up at the same time, > > so > > there are probably a ton of node descriptor change traps happening at the > > exact same time. We speculate a number of these are dropped/lost, and > > subsequently OpenSM never realizes that the node descriptor has changed. > > > > I don't know if the speculation sounds reasonable or not. Regardless, we're > > not sure of the best fix. > > > > A trivial fix would be to just make OpenSM re-scan the node descriptor of an > > HCA, perhaps during a heavy sweep. But I don't know if this is optimal. > > It'll > > introduce more MADs on the wire. However if the present solution is to > > restart OpenSM, we figure this can't be any worse. > > > > Just wondering what peoples thoughts are of if there's another obvious > > solution we're not seeing. > > > > Al > > > > -- > > Albert Chu > > [email protected] > > Computer Scientist > > High Performance Systems Division > > Lawrence Livermore National Laboratory > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to [email protected] > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Albert Chu [email protected] Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
