Re: Stuck Serial replication -- Need suggestions on recovery

Mallikarjun Tue, 17 Aug 2021 19:18:56 -0700

Thanks for the response @Duo

Inline reply


On Wed, Aug 18, 2021 at 7:37 AM 张铎(Duo Zhang) <[email protected]> wrote:

> This is the isRangeFinished method
>
>   private boolean isRangeFinished(long endBarrier, String
> encodedRegionName) throws IOException {
>     long pushedSeqId;
>     try {
>       pushedSeqId = storage.getLastSequenceId(encodedRegionName, peerId);
>     } catch (ReplicationException e) {
>       throw new IOException(
>         "Failed to get pushed sequence id for " + encodedRegionName +
> ", peer " + peerId, e);
>     }
>     // endBarrier is the open sequence number. When opening a region,
> the open sequence number will
>     // be set to the old max sequence id plus one, so here we need to
> minus one.
>     return pushedSeqId >= endBarrier - 1;
>   }
>
> So for this region
>
> rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++
> 18762210, 18775053, 18775079, 18775104, -- 18775119]
>
> We have already finished the first range [17776286, 18762210), but
> then we jump directly to range [18775053, 18775079), so the problem
> here is where is the [18762210, 18775053)...
>

These sequence ID's are present in WAL's which are not cleaned up. (in
OLDWALs)

Related Question: Is it allowed to have gaps in sequence IDs in WAL's for a
single region?
Example:  for Region: *24c765b42253f96b550831d83e99cc9e*, if sequence ID
*18775105* is present, can I expect *18775106 *is mandatory to be present?
or there can be gaps.


>
> And on the fix, you can clear the range information for the given
> regions in meta table, and then restart the clusters, I think the
> replication could continue.
>
>
If you mean removing some barriers so that replication is unblocked.
Doesn't it lead to *out of order events *replicated end up in
corrupting data?


> Mallikarjun <[email protected]> 于2021年8月17日周二 下午3:04写道：
> >
> > I have got into the following scenario. I won't go into details of how I
> > got here, since I am not able to reliably reproduce this scenario thus
> far.
> > (Typically happens when some rs goes down because of hardware issues)
> >
> > Let me explain to you the following details.
> > Col 1: Region server on which region is trying to replicate
> > Col 2: Region trying to replicate but stuck
> > Col 3: SequenceID which is being replicated and stuck because previous
> > range is not finished
> > Col 4: Checkpoint in zk until which sequence id is already replicated to
> > peer
> > Col 5: Replication barriers for that region. This is a list of open
> > sequence IDs on region movement. (+++ means where *checkpoint* belongs,
> ---
> > is where *to replicate seqid* belongs)
> >
> > There are in total 53 regions and 10 regionservers
> >
> > RegionServer Region Trying to replicate sequenceID Replicated until
> Current
> > Barriers
> > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++
> > 18762210, 18775053, 18775079, 18775104, -- 18775119]
> > rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221 [184183678, +++
> > 189117430, 189154191, -- 189154327]
> > rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510 [180662218, +++
> > 189062798, 189099508, -- 189099587]
> > rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722 [184170310, +++
> > 189078876, 189078950, -- 189104780, 189141509, 189141545, 189141595]
> > rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047 [180239864, +++
> > 183265048, 183270357, 183277363, 183300886, 183301001, -- 183301062]
> > rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047 [180239864,
> +++
> > 183265048, 183270357, 183277363, 183300886, 183301001, 183301062 --]
> > rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619 [189022618, +++
> > 189027035, 189085155, 189085241, 189085290]
> > rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191 [184126535, +++
> > 189098947, 189099080, -- 189099226]
> > rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365 [184103269, +++
> > 189012453, -- 189012538, 189074967, 189075016, 189075294, 189075349]
> > rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365 [184103269,
> +++
> > 189012453, 189012538, -- 189074967, 189075016, 189075294, 189075349]
> > rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129 [180684344, +++
> > 189084916, 189121283, 189121425, -- 189121602]
> > rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436 [184175242, +++
> > 189083026, 189145417, 189145457, -- 189145562, 189145723, 189145781]
> > rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879 [184159187, +++
> > 189091684, 189128534, -- 189128708]
> > rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331 [167737983, +++
> > 172392081, -- 172400093, 172446121, 172446172]
> > rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939 [184119428,
> +++
> > 189027429, -- 189053118, 189089933, 189089995, 189090059]
> > rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879 [184127026,
> +++
> > 189034143, --189048295, 189059834, 189096413, 189096513, 189096548,
> > 189096606]
> > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639 [18778549, +++
> > 18784783, 18793471, 18793484, 18793500 --]
> > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639 [18778549, +++
> > 18784783, 18793471, --- 18793484, 18793500]
> > rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357 [184116531, +++
> > 189025318, 189028497, --- 189051176, 189087488, 189087737, 189087850]
> > rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338 [184130122, +++
> > 189064848, --- 189101485, 189101698, 189101774]
> >
> >
> > My question is, how do I recover from here? Any suggestions.
> >
> > Only thought is that I have to replay by writing some MR jobs / some
> > scripts to read and replay selectively and update checkpoints.
> >
> > ---
> > Mallikarjun
>

Re: Stuck Serial replication -- Need suggestions on recovery

Reply via email to