Re: Stuck Serial replication -- Need suggestions on recovery

Mallikarjun Tue, 17 Aug 2021 19:50:20 -0700

Inline Reply

On Wed, Aug 18, 2021 at 8:06 AM 张铎(Duo Zhang) <palomino...@gmail.com> wrote:


> Mallikarjun <mallik.v.ar...@gmail.com> 于2021年8月18日周三 上午10:19写道：
> >
> > Thanks for the response @Duo
> >
> > Inline reply
> >
> > On Wed, Aug 18, 2021 at 7:37 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
> >
> > > This is the isRangeFinished method
> > >
> > >   private boolean isRangeFinished(long endBarrier, String
> > > encodedRegionName) throws IOException {
> > >     long pushedSeqId;
> > >     try {
> > >       pushedSeqId = storage.getLastSequenceId(encodedRegionName,
> peerId);
> > >     } catch (ReplicationException e) {
> > >       throw new IOException(
> > >         "Failed to get pushed sequence id for " + encodedRegionName +
> > > ", peer " + peerId, e);
> > >     }
> > >     // endBarrier is the open sequence number. When opening a region,
> > > the open sequence number will
> > >     // be set to the old max sequence id plus one, so here we need to
> > > minus one.
> > >     return pushedSeqId >= endBarrier - 1;
> > >   }
> > >
> > > So for this region
> > >
> > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++
> > > 18762210, 18775053, 18775079, 18775104, -- 18775119]
> > >
> > > We have already finished the first range [17776286, 18762210), but
> > > then we jump directly to range [18775053, 18775079), so the problem
> > > here is where is the [18762210, 18775053)...
> > >
> >
> > These sequence ID's are present in WAL's which are not cleaned up. (in
> > OLDWALs)
> >
> > Related Question: Is it allowed to have gaps in sequence IDs in WAL's
> for a
> > single region?
> > Example:  for Region: *24c765b42253f96b550831d83e99cc9e*, if sequence ID
> > *18775105* is present, can I expect *18775106 *is mandatory to be
> present?
> > or there can be gaps.
> Inside a range it is allowed to have gaps, but when reopening a
> region, we need to make sure there are no gaps otherwise the
> replication will be stuck.
>

Just curious to know, what are some scenarios which can lead to gaps? From
my small number of experiments It was consecutive in nature, I did not find
such gap scenarios.


> >
> >
> > >
> > > And on the fix, you can clear the range information for the given
> > > regions in meta table, and then restart the clusters, I think the
> > > replication could continue.
> > >
> > >
> > If you mean removing some barriers so that replication is unblocked.
> > Doesn't it lead to *out of order events *replicated end up in
> > corrupting data?
> Yes, the replication will be out of order. But this is the easier way
> to recover the replication.
> If you still want to obtain the order, then you need to find out the
> root cause of my question, where is the WAL for the missing ranges. Is
> it because we have already replicated the data but do not mark the
> range as finished, or we just lose the WAL data for the range?
>

Current scenario occurred in Active Passive cluster setup and replication
is stuck on the Passive side. So I won't be able to answer following
question

we have already replicated the data


 But the following comment can help me find data from oldWAL if last
sequence id is present or not before region movement.

but when reopening a region, we need to make sure there are no gaps
>

Alternatively, do you think it is a good idea to write a job similar
to *RecoveredReplicationSource
*which ensures serial replication to other cluster outside of hbase cluster?


> >
> >
> > > Mallikarjun <mallik.v.ar...@gmail.com> 于2021年8月17日周二 下午3:04写道：
> > > >
> > > > I have got into the following scenario. I won't go into details of
> how I
> > > > got here, since I am not able to reliably reproduce this scenario
> thus
> > > far.
> > > > (Typically happens when some rs goes down because of hardware issues)
> > > >
> > > > Let me explain to you the following details.
> > > > Col 1: Region server on which region is trying to replicate
> > > > Col 2: Region trying to replicate but stuck
> > > > Col 3: SequenceID which is being replicated and stuck because
> previous
> > > > range is not finished
> > > > Col 4: Checkpoint in zk until which sequence id is already
> replicated to
> > > > peer
> > > > Col 5: Replication barriers for that region. This is a list of open
> > > > sequence IDs on region movement. (+++ means where *checkpoint*
> belongs,
> > > ---
> > > > is where *to replicate seqid* belongs)
> > > >
> > > > There are in total 53 regions and 10 regionservers
> > > >
> > > > RegionServer Region Trying to replicate sequenceID Replicated until
> > > Current
> > > > Barriers
> > > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286,
> +++
> > > > 18762210, 18775053, 18775079, 18775104, -- 18775119]
> > > > rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221
> [184183678, +++
> > > > 189117430, 189154191, -- 189154327]
> > > > rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510
> [180662218, +++
> > > > 189062798, 189099508, -- 189099587]
> > > > rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722
> [184170310, +++
> > > > 189078876, 189078950, -- 189104780, 189141509, 189141545, 189141595]
> > > > rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047
> [180239864, +++
> > > > 183265048, 183270357, 183277363, 183300886, 183301001, -- 183301062]
> > > > rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047
> [180239864,
> > > +++
> > > > 183265048, 183270357, 183277363, 183300886, 183301001, 183301062 --]
> > > > rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619
> [189022618, +++
> > > > 189027035, 189085155, 189085241, 189085290]
> > > > rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191
> [184126535, +++
> > > > 189098947, 189099080, -- 189099226]
> > > > rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365
> [184103269, +++
> > > > 189012453, -- 189012538, 189074967, 189075016, 189075294, 189075349]
> > > > rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365
> [184103269,
> > > +++
> > > > 189012453, 189012538, -- 189074967, 189075016, 189075294, 189075349]
> > > > rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129
> [180684344, +++
> > > > 189084916, 189121283, 189121425, -- 189121602]
> > > > rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436
> [184175242, +++
> > > > 189083026, 189145417, 189145457, -- 189145562, 189145723, 189145781]
> > > > rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879
> [184159187, +++
> > > > 189091684, 189128534, -- 189128708]
> > > > rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331
> [167737983, +++
> > > > 172392081, -- 172400093, 172446121, 172446172]
> > > > rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939
> [184119428,
> > > +++
> > > > 189027429, -- 189053118, 189089933, 189089995, 189090059]
> > > > rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879
> [184127026,
> > > +++
> > > > 189034143, --189048295, 189059834, 189096413, 189096513, 189096548,
> > > > 189096606]
> > > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639 [18778549,
> +++
> > > > 18784783, 18793471, 18793484, 18793500 --]
> > > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639 [18778549,
> +++
> > > > 18784783, 18793471, --- 18793484, 18793500]
> > > > rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357
> [184116531, +++
> > > > 189025318, 189028497, --- 189051176, 189087488, 189087737, 189087850]
> > > > rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338
> [184130122, +++
> > > > 189064848, --- 189101485, 189101698, 189101774]
> > > >
> > > >
> > > > My question is, how do I recover from here? Any suggestions.
> > > >
> > > > Only thought is that I have to replay by writing some MR jobs / some
> > > > scripts to read and replay selectively and update checkpoints.
> > > >
> > > > ---
> > > > Mallikarjun
> > >
>

Re: Stuck Serial replication -- Need suggestions on recovery

Reply via email to