Re: OpenSM Failover

Aaron Knister Sat, 10 Oct 2009 17:04:44 -0700

I just stumbled across this in the release notes for opensm 3.2.6-

"* SMs do not hand-over when running on ConnectX in a switch-basedtopology."

So I guess that answers the question of whether or not what I'm seeingis "expected behavior". Out of curiosity what are the technicalreasons for this? I just tried opensm 3.3.2 and I still experience thesame behavior.


On Oct 10, 2009, at 7:38 PM, Aaron Knister wrote:

I'm not sure if this is the right place to post about this issue,but here goes-
I'm having problems with OpenSM failover.
I have two nodes running opensmd version "3.2.6_20090317" from RHEL5.4. I'm using a configuration file on both generated using opensm -c. When I start the subnet manager on node a, everything is fine. Itappears to reassign itself a lid of 1 which I think is expected.When I started the subnet manager on node b everything is fine. Ifyou query its lid it shows the subnet manager in a standby state.Now for the fun. If I stop opensmd on node a (service opensmd stop)then all of the traffic on the fabric stops. It takes node b'sOpenSM instance about 30 seconds to realize that node a's subnetmanager is dead and come up in the master state. Now when node a'ssubnet manager comes back (service opensmd start), all traffic onthe fabric stops and node b's subnet manager goes into the standbystate...but node a's subnet manager doesn't take over the fabric andcome up as master for about another 40 seconds (during this time theno traffic passes over the fabric). The below logs should helpillustrate what I'm seeing
Oct 10 19:14:14 node-a OpenSM[14132]: Entering DISCOVERING state
Oct 10 19:14:14 node-a OpenSM[14132]: Entering MASTER state
Oct 10 19:14:14 node-a OpenSM[14132]: SUBNET UP
Oct 10 19:14:25 node-b OpenSM[11197]: /var/log/opensm.log log fileopened
Oct 10 19:14:25 node-b OpenSM[11197]: OpenSM 3.2.6_20090317
Oct 10 19:14:25 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:14:26 node-b OpenSM[11197]: Entering STANDBY state

Oct 10 19:15:44 node-a OpenSM[14132]: Exiting SM
Oct 10 19:16:16 node-b OpenSM[11197]: Entering DISCOVERING state
Oct 10 19:16:16 node-b OpenSM[11197]: Entering MASTER state
Oct 10 19:18:52 node-a OpenSM[14213]: /var/log/opensm.log log fileopened
Oct 10 19:18:52 node-a OpenSM[14213]: OpenSM 3.2.6_20090317
Oct 10 19:18:52 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:18:53 node-b OpenSM[11197]: Entering STANDBY state
Oct 10 19:18:53 node-a OpenSM[14213]: Entering STANDBY state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering DISCOVERING state
Oct 10 19:19:33 node-a OpenSM[14213]: Entering MASTER state
Oct 10 19:19:33 node-a OpenSM[14213]: SUBNET UP
We have opensm 3.2.5_20081207 (ofed 1.4) on another cluster and itfails over and fails back almost instantly with seemingly no trafficinterruption if you gracefully stopped the active opensmd instance(service opensmd stop). Is the behavior I'm seeing considerednormal? I can understand the 30 seconds for the initial failover butwhy the 40 second failback when the original master comes back? Anyhelp is appreciated :)
BTW my switch is a Qlogic 12800-180 with the latest firmware and theHCAs are Mellanox MT26428 running firmware version 2.6.648.
Thanks!

-Aaron


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OpenSM Failover

Reply via email to