Thanks for the response @Duo Inline reply
On Wed, Aug 18, 2021 at 7:37 AM 张铎(Duo Zhang) <[email protected]> wrote: > This is the isRangeFinished method > > private boolean isRangeFinished(long endBarrier, String > encodedRegionName) throws IOException { > long pushedSeqId; > try { > pushedSeqId = storage.getLastSequenceId(encodedRegionName, peerId); > } catch (ReplicationException e) { > throw new IOException( > "Failed to get pushed sequence id for " + encodedRegionName + > ", peer " + peerId, e); > } > // endBarrier is the open sequence number. When opening a region, > the open sequence number will > // be set to the old max sequence id plus one, so here we need to > minus one. > return pushedSeqId >= endBarrier - 1; > } > > So for this region > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++ > 18762210, 18775053, 18775079, 18775104, -- 18775119] > > We have already finished the first range [17776286, 18762210), but > then we jump directly to range [18775053, 18775079), so the problem > here is where is the [18762210, 18775053)... > These sequence ID's are present in WAL's which are not cleaned up. (in OLDWALs) Related Question: Is it allowed to have gaps in sequence IDs in WAL's for a single region? Example: for Region: *24c765b42253f96b550831d83e99cc9e*, if sequence ID *18775105* is present, can I expect *18775106 *is mandatory to be present? or there can be gaps. > > And on the fix, you can clear the range information for the given > regions in meta table, and then restart the clusters, I think the > replication could continue. > > If you mean removing some barriers so that replication is unblocked. Doesn't it lead to *out of order events *replicated end up in corrupting data? > Mallikarjun <[email protected]> 于2021年8月17日周二 下午3:04写道: > > > > I have got into the following scenario. I won't go into details of how I > > got here, since I am not able to reliably reproduce this scenario thus > far. > > (Typically happens when some rs goes down because of hardware issues) > > > > Let me explain to you the following details. > > Col 1: Region server on which region is trying to replicate > > Col 2: Region trying to replicate but stuck > > Col 3: SequenceID which is being replicated and stuck because previous > > range is not finished > > Col 4: Checkpoint in zk until which sequence id is already replicated to > > peer > > Col 5: Replication barriers for that region. This is a list of open > > sequence IDs on region movement. (+++ means where *checkpoint* belongs, > --- > > is where *to replicate seqid* belongs) > > > > There are in total 53 regions and 10 regionservers > > > > RegionServer Region Trying to replicate sequenceID Replicated until > Current > > Barriers > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++ > > 18762210, 18775053, 18775079, 18775104, -- 18775119] > > rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221 [184183678, +++ > > 189117430, 189154191, -- 189154327] > > rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510 [180662218, +++ > > 189062798, 189099508, -- 189099587] > > rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722 [184170310, +++ > > 189078876, 189078950, -- 189104780, 189141509, 189141545, 189141595] > > rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047 [180239864, +++ > > 183265048, 183270357, 183277363, 183300886, 183301001, -- 183301062] > > rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047 [180239864, > +++ > > 183265048, 183270357, 183277363, 183300886, 183301001, 183301062 --] > > rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619 [189022618, +++ > > 189027035, 189085155, 189085241, 189085290] > > rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191 [184126535, +++ > > 189098947, 189099080, -- 189099226] > > rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365 [184103269, +++ > > 189012453, -- 189012538, 189074967, 189075016, 189075294, 189075349] > > rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365 [184103269, > +++ > > 189012453, 189012538, -- 189074967, 189075016, 189075294, 189075349] > > rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129 [180684344, +++ > > 189084916, 189121283, 189121425, -- 189121602] > > rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436 [184175242, +++ > > 189083026, 189145417, 189145457, -- 189145562, 189145723, 189145781] > > rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879 [184159187, +++ > > 189091684, 189128534, -- 189128708] > > rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331 [167737983, +++ > > 172392081, -- 172400093, 172446121, 172446172] > > rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939 [184119428, > +++ > > 189027429, -- 189053118, 189089933, 189089995, 189090059] > > rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879 [184127026, > +++ > > 189034143, --189048295, 189059834, 189096413, 189096513, 189096548, > > 189096606] > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639 [18778549, +++ > > 18784783, 18793471, 18793484, 18793500 --] > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639 [18778549, +++ > > 18784783, 18793471, --- 18793484, 18793500] > > rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357 [184116531, +++ > > 189025318, 189028497, --- 189051176, 189087488, 189087737, 189087850] > > rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338 [184130122, +++ > > 189064848, --- 189101485, 189101698, 189101774] > > > > > > My question is, how do I recover from here? Any suggestions. > > > > Only thought is that I have to replay by writing some MR jobs / some > > scripts to read and replay selectively and update checkpoints. > > > > --- > > Mallikarjun >
