Ok. I also see that cold starting the instance allows it to catch up. But,
that would cause it to flush out even the config store, which could've been
in a good state.

On Fri, Apr 7, 2017 at 6:33 PM, Tom Pantelis <tompante...@gmail.com> wrote:

> I think you need to restart all nodes.
>
> On Fri, Apr 7, 2017 at 6:47 PM, Srini Seetharaman <
> srini.seethara...@gmail.com> wrote:
>
>> Will do. Thanks for your help.
>>
>> Sorry to bother you this much. One last question in this topic - I'm
>> trying to put in a temporary workaround to recover such instances that are
>> stuck in a bad state. Is there a way to just remove only the
>> operational-shard (after which i will restart using the karaf.restart
>> property)? My hope is that it will sync up after the reboot.
>>
>> On Fri, Apr 7, 2017 at 2:04 PM, Tom Pantelis <tompante...@gmail.com>
>> wrote:
>>
>>> Also, when reproducing, please record the exact steps you did and the
>>> approximate time.
>>>
>>> On Fri, Apr 7, 2017 at 5:01 PM, Tom Pantelis <tompante...@gmail.com>
>>> wrote:
>>>
>>>> Also try it with Carbon as well if possible. If it doesn't reproduce
>>>> there then that would at least verify it's fixed in Carbon then we could
>>>> figure out which patch(es) to possibly back port to Boron.
>>>>
>>>>
>>>> On Fri, Apr 7, 2017 at 4:27 PM, Srini Seetharaman <
>>>> srini.seethara...@gmail.com> wrote:
>>>>
>>>>> Sure I will enable debug and retry.
>>>>>
>>>>> In this case, restarting didn't fix the issue. I had to remove the
>>>>> journal and snapshot and then restart.
>>>>>
>>>>> On Fri, Apr 7, 2017 at 1:16 PM, Tom Pantelis <tompante...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> That was for "Force install snapshot when follower log is ahead"
>>>>>> which is different than your case. I don't know the sequence that led up 
>>>>>> to
>>>>>> your issue. I have seen this before - I don't remember all the details 
>>>>>> but,
>>>>>> from what I recall, it's caused by the journal indexing getting screwed 
>>>>>> up
>>>>>> at some point earlier during restarts and leader changes. As I 
>>>>>> mentioned, I
>>>>>> think it may be fixed in Carbon but I don't recall what patch or patches.
>>>>>> It can be very tedious debugging such issues and usually it's because of
>>>>>> something that happened earlier so it really helps to
>>>>>> have org.opendaylight.controller.cluster.datastore.Shard enabled so
>>>>>> there's a paper trail.
>>>>>>
>>>>>> On Fri, Apr 7, 2017 at 3:05 PM, Srini Seetharaman <
>>>>>> srini.seethara...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Tom
>>>>>>> I was referring to https://git.opendaylight.org/gerrit/#/c/41323/
>>>>>>> that was included as early as Beryllium. I am using the latest Boron-SR3
>>>>>>> codebase and it is happening.
>>>>>>>
>>>>>>> Srini.
>>>>>>>
>>>>>>> On Fri, Apr 7, 2017 at 11:53 AM, Tom Pantelis <tompante...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Srini,
>>>>>>>>
>>>>>>>> Which patch are you referring to? Forcing a snapshot isn't
>>>>>>>> something you enable - it does it based on certain conditions. I have 
>>>>>>>> seen
>>>>>>>> that behavior before with the operational store b/c it's not 
>>>>>>>> persistent. I
>>>>>>>> think it's fixed in master so can you try the latest codebase and see 
>>>>>>>> if it
>>>>>>>> reproduces?
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> On Fri, Apr 7, 2017 at 2:35 PM, Srini Seetharaman <
>>>>>>>> srini.seethara...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Tom
>>>>>>>>> I see that you pushed a patch to force a snapshot install for a
>>>>>>>>> follower. I want to enable that in my setup because I frequently see 
>>>>>>>>> cases
>>>>>>>>> where the follower prints the following after cluster is recovered. It
>>>>>>>>> prints it forever and does not seem to recover.
>>>>>>>>>
>>>>>>>>> 2017-04-07 11:24:41,471 | INFO  | lt-dispatcher-20 | Shard
>>>>>>>>>                    | 187 - org.opendaylight.controller.sal-akka-raft
>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational
>>>>>>>>> (Follower): The log is not empty but the prevLogIndex 1685 was not 
>>>>>>>>> found in
>>>>>>>>> it - lastIndex: 1685, snapshotIndex: -1
>>>>>>>>> 2017-04-07 11:24:41,471 | INFO  | lt-dispatcher-20 | Shard
>>>>>>>>>                    | 187 - org.opendaylight.controller.sal-akka-raft
>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational
>>>>>>>>> (Follower): Follower is out-of-sync so sending negative reply:
>>>>>>>>> AppendEntriesReply [term=23, success=false, 
>>>>>>>>> followerId=member-2-shard-default-operational,
>>>>>>>>> logLastIndex=1685, logLastTerm=9, forceInstallSnapshot=false,
>>>>>>>>> payloadVersion=5, raftVersion=3]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Srini.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to