Re: OpenSM Failover

Yevgeny Kliteynik Mon, 12 Oct 2009 00:34:53 -0700

Aaron,

Aaron Knister wrote:

I just stumbled across this in the release notes for opensm 3.2.6-
"* SMs do not hand-over when running on ConnectX in a switch-basedtopology."
So I guess that answers the question of whether or not what I'm seeingis "expected behavior". Out of curiosity what are the technical reasonsfor this? I just tried opensm 3.3.2 and I still experience the samebehavior.


There was a hand-over problem in OFED 1.4, but later it turned
out to be FW issue. The thing is, FW version 2.6.648 doesn't
have this bug any more...

The 30 seconds for the initial failover is expected, but the40 second failback when the original master comes back is a problem.


Can you please double check that the FW version 2.6.648 is used on
both HCAs that run OSM?
And what is the FW version of HCAs that don't have this problem?

Also, can you please reproduce the issue running OSM as follows:
        opensm -V -e -s 0
on both nodes and attach the /var/log/opensm.log files?

-- Yevgeny

On Oct 10, 2009, at 7:38 PM, Aaron Knister wrote:
I'm not sure if this is the right place to post about this issue, buthere goes-
I'm having problems with OpenSM failover.
I have two nodes running opensmd version "3.2.6_20090317" from RHEL5.4. I'm using a configuration file on both generated using opensm -c.When I start the subnet manager on node a, everything is fine. Itappears to reassign itself a lid of 1 which I think is expected. WhenI started the subnet manager on node b everything is fine. If youquery its lid it shows the subnet manager in a standby state. Now forthe fun. If I stop opensmd on node a (service opensmd stop) then allof the traffic on the fabric stops. It takes node b's OpenSM instanceabout 30 seconds to realize that node a's subnet manager is dead andcome up in the master state. Now when node a's subnet manager comesback (service opensmd start), all traffic on the fabric stops and nodeb's subnet manager goes into the standby state...but node a's subnetmanager doesn't take over the fabric and come up as master for aboutanother 40 seconds (during this time the no traffic passes over thefabric). The below logs should help illustrate what I'm seeing
Oct 10 19:14:14 node-a OpenSM[14132]: Entering DISCOVERING state
Oct 10 19:14:14 node-a OpenSM[14132]: Entering MASTER state
Oct 10 19:14:14 node-a OpenSM[14132]: SUBNET UP

Oct 10 19:14:25 node-b OpenSM[11197]: /var/log/opensm.log log file opened
Oct 10 19:14:25 node-b OpenSM[11197]: OpenSM 3.2.6_20090317
Oct 10 19:14:25 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:14:26 node-b OpenSM[11197]: Entering STANDBY state

Oct 10 19:15:44 node-a OpenSM[14132]: Exiting SM
Oct 10 19:16:16 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:16:16 node-b OpenSM[11197]: Entering MASTER state

Oct 10 19:18:52 node-a OpenSM[14213]: /var/log/opensm.log log file opened
Oct 10 19:18:52 node-a OpenSM[14213]: OpenSM 3.2.6_20090317
Oct 10 19:18:52 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:18:53 node-b OpenSM[11197]: Entering STANDBY state
Oct 10 19:18:53 node-a OpenSM[14213]: Entering STANDBY state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering MASTER state
Oct 10 19:19:33 node-a OpenSM[14213]: SUBNET UP
We have opensm 3.2.5_20081207 (ofed 1.4) on another cluster and itfails over and fails back almost instantly with seemingly no trafficinterruption if you gracefully stopped the active opensmd instance(service opensmd stop). Is the behavior I'm seeing considered normal?I can understand the 30 seconds for the initial failover but why the40 second failback when the original master comes back? Any help isappreciated :)
BTW my switch is a Qlogic 12800-180 with the latest firmware and theHCAs are Mellanox MT26428 running firmware version 2.6.648.
Thanks!

-Aaron
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OpenSM Failover

Reply via email to