It will be difficult to get you those logs until the older cluster is decommissioned (which in theory should be soon), but as soon as I am able I will get them too you.
On Wed, Oct 14, 2009 at 4:46 AM, Yevgeny Kliteynik <[email protected]> wrote: > Aaron, > > Aaron Knister wrote: >>>> >>>> As I said, the older opensms >>>> on the older mellanox model HCAs failsover and failsback instantly. >>> >>> The instant failback is expected, and this is the bug that >>> we're discussing. As for the instant failover - I'll check >>> how the things supposed to work and get back to you. > > After checking this thing, I don't understand how the > instant failover is possible. The only case it can > work is if you don't have a switch in your subnet - just two HCAs connected > directly to each other. > Is this the case? > > If not, then I'd like to see opensm logs. > Please run opensm as before (-V -s 0 -e). > Start OSM on node A with high priority. > Start OSM on node B with low priority. > Kill OSM on node A, and see that OSM on > node B becomes master. > > I need only the log of the opensm on node B. > Best if you could just attach it to the bugzilla > issue form, but if you can't - you can mail it to me. > > -- Yevgeny > >>> -- Yevgeny >>> >>>> On Tue, Oct 13, 2009 at 11:32 AM, Yevgeny Kliteynik >>>> <[email protected]> wrote: >>>>> >>>>> Aaron, >>>>> >>>>> Thanks for the logs, this was really helpful. >>>>> Looks like there is a handover race in the OSM - >>>>> SM on node A misses the fact that SM on node B >>>>> have gave up its mastership. >>>>> >>>>> There is a bugzilla issue the describes all the >>>>> details of this race: >>>>> >>>>> https://bugs.openfabrics.org/show_bug.cgi?id=1499 >>>>> >>>>> I've updated the issue form with your case, and we will continue >>>>> following >>>>> this bug there. >>>>> >>>>> -- Yevgeny >>>>> >>>>> Aaron Knister wrote: >>>>>> >>>>>> While the adapters have mellanox chipsets their actually IBM OEM >>>>>> branded and IBM hasn't released the 2.7 fw yet. I'm a little hesitant >>>>>> to apply the generic Mellanox FW. >>>>>> >>>>>> On Mon, Oct 12, 2009 at 4:22 AM, Yevgeny Kliteynik >>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Or Gerlitz wrote: >>>>>>>> >>>>>>>> Yevgeny Kliteynik wrote: >>>>>>>>> >>>>>>>>> There was a hand-over problem in OFED 1.4, but later it turned out >>>>>>>>> to >>>>>>>>> be >>>>>>>>> FW issue. The thing is, FW version 2.6.648 doesn't have this bug >>>>>>>>> any >>>>>>>>> more... >>>>>>>> >>>>>>>> so things should work fine with the newly released 2.7 firmware? >>>>>>> >>>>>>> Yes >>>>>>> >>>>>>>> if this is still under question, Aaron, I suggest you open a >>>>>>>> bugzilla >>>>>>>> case >>>>>>>> @ https://bugs.openfabrics.org and we can track from there. >>>>>>> >>>>>>> Good idea. >>>>>>> >>>>>>> -- Yevgeny >>>>>>> >>>>>>>> Or. >>>>>>>> >>>>>>>> >>> >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
