Ok. I also see that cold starting the instance allows it to catch up. But, that would cause it to flush out even the config store, which could've been in a good state.
On Fri, Apr 7, 2017 at 6:33 PM, Tom Pantelis <tompante...@gmail.com> wrote: > I think you need to restart all nodes. > > On Fri, Apr 7, 2017 at 6:47 PM, Srini Seetharaman < > srini.seethara...@gmail.com> wrote: > >> Will do. Thanks for your help. >> >> Sorry to bother you this much. One last question in this topic - I'm >> trying to put in a temporary workaround to recover such instances that are >> stuck in a bad state. Is there a way to just remove only the >> operational-shard (after which i will restart using the karaf.restart >> property)? My hope is that it will sync up after the reboot. >> >> On Fri, Apr 7, 2017 at 2:04 PM, Tom Pantelis <tompante...@gmail.com> >> wrote: >> >>> Also, when reproducing, please record the exact steps you did and the >>> approximate time. >>> >>> On Fri, Apr 7, 2017 at 5:01 PM, Tom Pantelis <tompante...@gmail.com> >>> wrote: >>> >>>> Also try it with Carbon as well if possible. If it doesn't reproduce >>>> there then that would at least verify it's fixed in Carbon then we could >>>> figure out which patch(es) to possibly back port to Boron. >>>> >>>> >>>> On Fri, Apr 7, 2017 at 4:27 PM, Srini Seetharaman < >>>> srini.seethara...@gmail.com> wrote: >>>> >>>>> Sure I will enable debug and retry. >>>>> >>>>> In this case, restarting didn't fix the issue. I had to remove the >>>>> journal and snapshot and then restart. >>>>> >>>>> On Fri, Apr 7, 2017 at 1:16 PM, Tom Pantelis <tompante...@gmail.com> >>>>> wrote: >>>>> >>>>>> That was for "Force install snapshot when follower log is ahead" >>>>>> which is different than your case. I don't know the sequence that led up >>>>>> to >>>>>> your issue. I have seen this before - I don't remember all the details >>>>>> but, >>>>>> from what I recall, it's caused by the journal indexing getting screwed >>>>>> up >>>>>> at some point earlier during restarts and leader changes. As I >>>>>> mentioned, I >>>>>> think it may be fixed in Carbon but I don't recall what patch or patches. >>>>>> It can be very tedious debugging such issues and usually it's because of >>>>>> something that happened earlier so it really helps to >>>>>> have org.opendaylight.controller.cluster.datastore.Shard enabled so >>>>>> there's a paper trail. >>>>>> >>>>>> On Fri, Apr 7, 2017 at 3:05 PM, Srini Seetharaman < >>>>>> srini.seethara...@gmail.com> wrote: >>>>>> >>>>>>> Hi Tom >>>>>>> I was referring to https://git.opendaylight.org/gerrit/#/c/41323/ >>>>>>> that was included as early as Beryllium. I am using the latest Boron-SR3 >>>>>>> codebase and it is happening. >>>>>>> >>>>>>> Srini. >>>>>>> >>>>>>> On Fri, Apr 7, 2017 at 11:53 AM, Tom Pantelis <tompante...@gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> Srini, >>>>>>>> >>>>>>>> Which patch are you referring to? Forcing a snapshot isn't >>>>>>>> something you enable - it does it based on certain conditions. I have >>>>>>>> seen >>>>>>>> that behavior before with the operational store b/c it's not >>>>>>>> persistent. I >>>>>>>> think it's fixed in master so can you try the latest codebase and see >>>>>>>> if it >>>>>>>> reproduces? >>>>>>>> >>>>>>>> Tom >>>>>>>> >>>>>>>> Tom >>>>>>>> >>>>>>>> On Fri, Apr 7, 2017 at 2:35 PM, Srini Seetharaman < >>>>>>>> srini.seethara...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Tom >>>>>>>>> I see that you pushed a patch to force a snapshot install for a >>>>>>>>> follower. I want to enable that in my setup because I frequently see >>>>>>>>> cases >>>>>>>>> where the follower prints the following after cluster is recovered. It >>>>>>>>> prints it forever and does not seem to recover. >>>>>>>>> >>>>>>>>> 2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard >>>>>>>>> | 187 - org.opendaylight.controller.sal-akka-raft >>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational >>>>>>>>> (Follower): The log is not empty but the prevLogIndex 1685 was not >>>>>>>>> found in >>>>>>>>> it - lastIndex: 1685, snapshotIndex: -1 >>>>>>>>> 2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard >>>>>>>>> | 187 - org.opendaylight.controller.sal-akka-raft >>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational >>>>>>>>> (Follower): Follower is out-of-sync so sending negative reply: >>>>>>>>> AppendEntriesReply [term=23, success=false, >>>>>>>>> followerId=member-2-shard-default-operational, >>>>>>>>> logLastIndex=1685, logLastTerm=9, forceInstallSnapshot=false, >>>>>>>>> payloadVersion=5, raftVersion=3] >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Srini. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev