Filed https://bugs.opendaylight.org/show_bug.cgi?id=8199 to track this issue.
On Fri, Apr 7, 2017 at 9:12 PM, Srini Seetharaman < srini.seethara...@gmail.com> wrote: > Ok. I also see that cold starting the instance allows it to catch up. But, > that would cause it to flush out even the config store, which could've been > in a good state. > > On Fri, Apr 7, 2017 at 6:33 PM, Tom Pantelis <tompante...@gmail.com> > wrote: > >> I think you need to restart all nodes. >> >> On Fri, Apr 7, 2017 at 6:47 PM, Srini Seetharaman < >> srini.seethara...@gmail.com> wrote: >> >>> Will do. Thanks for your help. >>> >>> Sorry to bother you this much. One last question in this topic - I'm >>> trying to put in a temporary workaround to recover such instances that are >>> stuck in a bad state. Is there a way to just remove only the >>> operational-shard (after which i will restart using the karaf.restart >>> property)? My hope is that it will sync up after the reboot. >>> >>> On Fri, Apr 7, 2017 at 2:04 PM, Tom Pantelis <tompante...@gmail.com> >>> wrote: >>> >>>> Also, when reproducing, please record the exact steps you did and the >>>> approximate time. >>>> >>>> On Fri, Apr 7, 2017 at 5:01 PM, Tom Pantelis <tompante...@gmail.com> >>>> wrote: >>>> >>>>> Also try it with Carbon as well if possible. If it doesn't reproduce >>>>> there then that would at least verify it's fixed in Carbon then we could >>>>> figure out which patch(es) to possibly back port to Boron. >>>>> >>>>> >>>>> On Fri, Apr 7, 2017 at 4:27 PM, Srini Seetharaman < >>>>> srini.seethara...@gmail.com> wrote: >>>>> >>>>>> Sure I will enable debug and retry. >>>>>> >>>>>> In this case, restarting didn't fix the issue. I had to remove the >>>>>> journal and snapshot and then restart. >>>>>> >>>>>> On Fri, Apr 7, 2017 at 1:16 PM, Tom Pantelis <tompante...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> That was for "Force install snapshot when follower log is ahead" >>>>>>> which is different than your case. I don't know the sequence that led >>>>>>> up to >>>>>>> your issue. I have seen this before - I don't remember all the details >>>>>>> but, >>>>>>> from what I recall, it's caused by the journal indexing getting screwed >>>>>>> up >>>>>>> at some point earlier during restarts and leader changes. As I >>>>>>> mentioned, I >>>>>>> think it may be fixed in Carbon but I don't recall what patch or >>>>>>> patches. >>>>>>> It can be very tedious debugging such issues and usually it's because of >>>>>>> something that happened earlier so it really helps to >>>>>>> have org.opendaylight.controller.cluster.datastore.Shard enabled so >>>>>>> there's a paper trail. >>>>>>> >>>>>>> On Fri, Apr 7, 2017 at 3:05 PM, Srini Seetharaman < >>>>>>> srini.seethara...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Tom >>>>>>>> I was referring to https://git.opendaylight.org/gerrit/#/c/41323/ >>>>>>>> that was included as early as Beryllium. I am using the latest >>>>>>>> Boron-SR3 >>>>>>>> codebase and it is happening. >>>>>>>> >>>>>>>> Srini. >>>>>>>> >>>>>>>> On Fri, Apr 7, 2017 at 11:53 AM, Tom Pantelis < >>>>>>>> tompante...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Srini, >>>>>>>>> >>>>>>>>> Which patch are you referring to? Forcing a snapshot isn't >>>>>>>>> something you enable - it does it based on certain conditions. I have >>>>>>>>> seen >>>>>>>>> that behavior before with the operational store b/c it's not >>>>>>>>> persistent. I >>>>>>>>> think it's fixed in master so can you try the latest codebase and see >>>>>>>>> if it >>>>>>>>> reproduces? >>>>>>>>> >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> On Fri, Apr 7, 2017 at 2:35 PM, Srini Seetharaman < >>>>>>>>> srini.seethara...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Tom >>>>>>>>>> I see that you pushed a patch to force a snapshot install for a >>>>>>>>>> follower. I want to enable that in my setup because I frequently see >>>>>>>>>> cases >>>>>>>>>> where the follower prints the following after cluster is recovered. >>>>>>>>>> It >>>>>>>>>> prints it forever and does not seem to recover. >>>>>>>>>> >>>>>>>>>> 2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard >>>>>>>>>> | 187 - >>>>>>>>>> org.opendaylight.controller.sal-akka-raft >>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational >>>>>>>>>> (Follower): The log is not empty but the prevLogIndex 1685 was not >>>>>>>>>> found in >>>>>>>>>> it - lastIndex: 1685, snapshotIndex: -1 >>>>>>>>>> 2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard >>>>>>>>>> | 187 - >>>>>>>>>> org.opendaylight.controller.sal-akka-raft >>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational >>>>>>>>>> (Follower): Follower is out-of-sync so sending negative reply: >>>>>>>>>> AppendEntriesReply [term=23, success=false, >>>>>>>>>> followerId=member-2-shard-default-operational, >>>>>>>>>> logLastIndex=1685, logLastTerm=9, forceInstallSnapshot=false, >>>>>>>>>> payloadVersion=5, raftVersion=3] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Srini. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev