On Mon, 2013-06-17 at 22:00 +0000, Weiny, Ira wrote:
> Does running "update_desc" in the console fix this?

This worked as a short term solution.  But we're still thinking about a
longer term one that requires less interaction.

Al

> Ira
> 
> > -----Original Message-----
> > From: [email protected] [mailto:linux-rdma-
> > [email protected]] On Behalf Of Albert Chu
> > Sent: Monday, June 17, 2013 2:38 PM
> > To: [email protected]
> > Subject: Node Description mismatch between saquery & smpquery
> > 
> > We've recently noticed that the Node Description for a node can mis-
> > mismatch between the output of smpquery and saquery.  For example:
> > 
> > # smpquery NodeDesc 427
> > Node Description:.................sierra1932 qib0
> > 
> > # saquery NodeRecord 427 | grep NodeDesc
> >                 NodeDescription.........QLogic Infiniband HCA
> > 
> > A restart of OpenSM is the current solution to resolve this.
> > 
> > We've noticed it occurring more often on our larger clusters than our 
> > smaller
> > clusters, leading to a speculation about why it is happening.
> > 
> > The speculation is when a node comes up, there is a window of time in which
> > the HCA is up, can be scanned by OpenSM, but not yet have its node
> > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> > During this window, OpenSM reads/stores the non-desired node descriptor
> > (in the above case the non-desired "Qlogic Infiniband HCA").
> > 
> > When the node descriptor is changed, a trap should be sent to opensm
> > indicating the change.  Normally OpenSM gets the trap and reads the new
> > node descriptor.
> > 
> > On our large clusters all nodes are typically brought up at the same time, 
> > so
> > there are probably a ton of node descriptor change traps happening at the
> > exact same time.  We speculate a number of these are dropped/lost, and
> > subsequently OpenSM never realizes that the node descriptor has changed.
> > 
> > I don't know if the speculation sounds reasonable or not.  Regardless, we're
> > not sure of the best fix.
> > 
> > A trivial fix would be to just make OpenSM re-scan the node descriptor of an
> > HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  
> > It'll
> > introduce more MADs on the wire.  However if the present solution is to
> > restart OpenSM, we figure this can't be any worse.
> > 
> > Just wondering what peoples thoughts are of if there's another obvious
> > solution we're not seeing.
> > 
> > Al
> > 
> > --
> > Albert Chu
> > [email protected]
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Albert Chu
[email protected]
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to