I wouldn't expect an SM that goes belly up to failover instantly, but I would have thought that if you shut it down gracefully (with a regular kill signal) that it could notify a standby SM that it's going away. Either way, I appreciate the help!
On Tue, Oct 13, 2009 at 12:13 PM, Yevgeny Kliteynik <[email protected]> wrote: > Aaron Knister wrote: >> >> Thanks! I really appreciate that. >> >> I still have a question about the initial failover- I'm still >> wondering why there's a 30 second delay. Wouldn't nodeA send some type >> of handover message (my IB knowledge is limited) to notify a subnet >> manager of a lower priority to take over? > > When master SM dies, it can't notify anyone that it is dead. > Standby SM keeps polling the master SM, and if the latter is > dead, it will be getting timeouts on these polls. > The number of polls, and the time that passes between these > polls is configurable. > Default is 4 polls with 10 seconds waiting in between. > > Note that SM fail-over might be very destructive to the traffic. Depending > on the SM configuration, it can change LIDs, > routing, it will require flushing of all the path resolutions > on all the fabric nodes and refreshing all the multicast > membership in the subnet. > >> As I said, the older opensms >> on the older mellanox model HCAs failsover and failsback instantly. > > The instant failback is expected, and this is the bug that > we're discussing. As for the instant failover - I'll check > how the things supposed to work and get back to you. > > -- Yevgeny > >> On Tue, Oct 13, 2009 at 11:32 AM, Yevgeny Kliteynik >> <[email protected]> wrote: >>> >>> Aaron, >>> >>> Thanks for the logs, this was really helpful. >>> Looks like there is a handover race in the OSM - >>> SM on node A misses the fact that SM on node B >>> have gave up its mastership. >>> >>> There is a bugzilla issue the describes all the >>> details of this race: >>> >>> https://bugs.openfabrics.org/show_bug.cgi?id=1499 >>> >>> I've updated the issue form with your case, and we will continue >>> following >>> this bug there. >>> >>> -- Yevgeny >>> >>> Aaron Knister wrote: >>>> >>>> While the adapters have mellanox chipsets their actually IBM OEM >>>> branded and IBM hasn't released the 2.7 fw yet. I'm a little hesitant >>>> to apply the generic Mellanox FW. >>>> >>>> On Mon, Oct 12, 2009 at 4:22 AM, Yevgeny Kliteynik >>>> <[email protected]> wrote: >>>>> >>>>> Or Gerlitz wrote: >>>>>> >>>>>> Yevgeny Kliteynik wrote: >>>>>>> >>>>>>> There was a hand-over problem in OFED 1.4, but later it turned out >>>>>>> to >>>>>>> be >>>>>>> FW issue. The thing is, FW version 2.6.648 doesn't have this bug any >>>>>>> more... >>>>>> >>>>>> so things should work fine with the newly released 2.7 firmware? >>>>> >>>>> Yes >>>>> >>>>>> if this is still under question, Aaron, I suggest you open a bugzilla >>>>>> case >>>>>> @ https://bugs.openfabrics.org and we can track from there. >>>>> >>>>> Good idea. >>>>> >>>>> -- Yevgeny >>>>> >>>>>> Or. >>>>>> >>>>>> >>> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
