Filed https://bugs.opendaylight.org/show_bug.cgi?id=8199 to track this
issue.

On Fri, Apr 7, 2017 at 9:12 PM, Srini Seetharaman <
srini.seethara...@gmail.com> wrote:

> Ok. I also see that cold starting the instance allows it to catch up. But,
> that would cause it to flush out even the config store, which could've been
> in a good state.
>
> On Fri, Apr 7, 2017 at 6:33 PM, Tom Pantelis <tompante...@gmail.com>
> wrote:
>
>> I think you need to restart all nodes.
>>
>> On Fri, Apr 7, 2017 at 6:47 PM, Srini Seetharaman <
>> srini.seethara...@gmail.com> wrote:
>>
>>> Will do. Thanks for your help.
>>>
>>> Sorry to bother you this much. One last question in this topic - I'm
>>> trying to put in a temporary workaround to recover such instances that are
>>> stuck in a bad state. Is there a way to just remove only the
>>> operational-shard (after which i will restart using the karaf.restart
>>> property)? My hope is that it will sync up after the reboot.
>>>
>>> On Fri, Apr 7, 2017 at 2:04 PM, Tom Pantelis <tompante...@gmail.com>
>>> wrote:
>>>
>>>> Also, when reproducing, please record the exact steps you did and the
>>>> approximate time.
>>>>
>>>> On Fri, Apr 7, 2017 at 5:01 PM, Tom Pantelis <tompante...@gmail.com>
>>>> wrote:
>>>>
>>>>> Also try it with Carbon as well if possible. If it doesn't reproduce
>>>>> there then that would at least verify it's fixed in Carbon then we could
>>>>> figure out which patch(es) to possibly back port to Boron.
>>>>>
>>>>>
>>>>> On Fri, Apr 7, 2017 at 4:27 PM, Srini Seetharaman <
>>>>> srini.seethara...@gmail.com> wrote:
>>>>>
>>>>>> Sure I will enable debug and retry.
>>>>>>
>>>>>> In this case, restarting didn't fix the issue. I had to remove the
>>>>>> journal and snapshot and then restart.
>>>>>>
>>>>>> On Fri, Apr 7, 2017 at 1:16 PM, Tom Pantelis <tompante...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> That was for "Force install snapshot when follower log is ahead"
>>>>>>> which is different than your case. I don't know the sequence that led 
>>>>>>> up to
>>>>>>> your issue. I have seen this before - I don't remember all the details 
>>>>>>> but,
>>>>>>> from what I recall, it's caused by the journal indexing getting screwed 
>>>>>>> up
>>>>>>> at some point earlier during restarts and leader changes. As I 
>>>>>>> mentioned, I
>>>>>>> think it may be fixed in Carbon but I don't recall what patch or 
>>>>>>> patches.
>>>>>>> It can be very tedious debugging such issues and usually it's because of
>>>>>>> something that happened earlier so it really helps to
>>>>>>> have org.opendaylight.controller.cluster.datastore.Shard enabled so
>>>>>>> there's a paper trail.
>>>>>>>
>>>>>>> On Fri, Apr 7, 2017 at 3:05 PM, Srini Seetharaman <
>>>>>>> srini.seethara...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Tom
>>>>>>>> I was referring to https://git.opendaylight.org/gerrit/#/c/41323/
>>>>>>>> that was included as early as Beryllium. I am using the latest 
>>>>>>>> Boron-SR3
>>>>>>>> codebase and it is happening.
>>>>>>>>
>>>>>>>> Srini.
>>>>>>>>
>>>>>>>> On Fri, Apr 7, 2017 at 11:53 AM, Tom Pantelis <
>>>>>>>> tompante...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Srini,
>>>>>>>>>
>>>>>>>>> Which patch are you referring to? Forcing a snapshot isn't
>>>>>>>>> something you enable - it does it based on certain conditions. I have 
>>>>>>>>> seen
>>>>>>>>> that behavior before with the operational store b/c it's not 
>>>>>>>>> persistent. I
>>>>>>>>> think it's fixed in master so can you try the latest codebase and see 
>>>>>>>>> if it
>>>>>>>>> reproduces?
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> On Fri, Apr 7, 2017 at 2:35 PM, Srini Seetharaman <
>>>>>>>>> srini.seethara...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Tom
>>>>>>>>>> I see that you pushed a patch to force a snapshot install for a
>>>>>>>>>> follower. I want to enable that in my setup because I frequently see 
>>>>>>>>>> cases
>>>>>>>>>> where the follower prints the following after cluster is recovered. 
>>>>>>>>>> It
>>>>>>>>>> prints it forever and does not seem to recover.
>>>>>>>>>>
>>>>>>>>>> 2017-04-07 11:24:41,471 | INFO  | lt-dispatcher-20 | Shard
>>>>>>>>>>                      | 187 - 
>>>>>>>>>> org.opendaylight.controller.sal-akka-raft
>>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational
>>>>>>>>>> (Follower): The log is not empty but the prevLogIndex 1685 was not 
>>>>>>>>>> found in
>>>>>>>>>> it - lastIndex: 1685, snapshotIndex: -1
>>>>>>>>>> 2017-04-07 11:24:41,471 | INFO  | lt-dispatcher-20 | Shard
>>>>>>>>>>                      | 187 - 
>>>>>>>>>> org.opendaylight.controller.sal-akka-raft
>>>>>>>>>> - 1.4.3.Boron-SR3 | member-2-shard-default-operational
>>>>>>>>>> (Follower): Follower is out-of-sync so sending negative reply:
>>>>>>>>>> AppendEntriesReply [term=23, success=false, 
>>>>>>>>>> followerId=member-2-shard-default-operational,
>>>>>>>>>> logLastIndex=1685, logLastTerm=9, forceInstallSnapshot=false,
>>>>>>>>>> payloadVersion=5, raftVersion=3]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Srini.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Reply via email to