Thanks for answering the queries.

---
Mallikarjun


On Wed, Aug 18, 2021 at 9:32 AM 张铎(Duo Zhang) <palomino...@gmail.com> wrote:

> In hbase, the mvcc write number and wal sequence id are the same
> thing, so when we just want to bump the mvcc number, we will not write
> an actual WAL but the sequence id will be increased.
>
> And I think it will be good to have a MR job to replicate WAL serially.
>
> Mallikarjun <mallik.v.ar...@gmail.com> 于2021年8月18日周三 上午10:50写道:
> >
> > Inline Reply
> >
> > On Wed, Aug 18, 2021 at 8:06 AM 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
> >
> > > Mallikarjun <mallik.v.ar...@gmail.com> 于2021年8月18日周三 上午10:19写道:
> > > >
> > > > Thanks for the response @Duo
> > > >
> > > > Inline reply
> > > >
> > > > On Wed, Aug 18, 2021 at 7:37 AM 张铎(Duo Zhang) <palomino...@gmail.com
> >
> > > wrote:
> > > >
> > > > > This is the isRangeFinished method
> > > > >
> > > > >   private boolean isRangeFinished(long endBarrier, String
> > > > > encodedRegionName) throws IOException {
> > > > >     long pushedSeqId;
> > > > >     try {
> > > > >       pushedSeqId = storage.getLastSequenceId(encodedRegionName,
> > > peerId);
> > > > >     } catch (ReplicationException e) {
> > > > >       throw new IOException(
> > > > >         "Failed to get pushed sequence id for " +
> encodedRegionName +
> > > > > ", peer " + peerId, e);
> > > > >     }
> > > > >     // endBarrier is the open sequence number. When opening a
> region,
> > > > > the open sequence number will
> > > > >     // be set to the old max sequence id plus one, so here we need
> to
> > > > > minus one.
> > > > >     return pushedSeqId >= endBarrier - 1;
> > > > >   }
> > > > >
> > > > > So for this region
> > > > >
> > > > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286,
> +++
> > > > > 18762210, 18775053, 18775079, 18775104, -- 18775119]
> > > > >
> > > > > We have already finished the first range [17776286, 18762210), but
> > > > > then we jump directly to range [18775053, 18775079), so the problem
> > > > > here is where is the [18762210, 18775053)...
> > > > >
> > > >
> > > > These sequence ID's are present in WAL's which are not cleaned up.
> (in
> > > > OLDWALs)
> > > >
> > > > Related Question: Is it allowed to have gaps in sequence IDs in WAL's
> > > for a
> > > > single region?
> > > > Example:  for Region: *24c765b42253f96b550831d83e99cc9e*, if
> sequence ID
> > > > *18775105* is present, can I expect *18775106 *is mandatory to be
> > > present?
> > > > or there can be gaps.
> > > Inside a range it is allowed to have gaps, but when reopening a
> > > region, we need to make sure there are no gaps otherwise the
> > > replication will be stuck.
> > >
> >
> > Just curious to know, what are some scenarios which can lead to gaps?
> From
> > my small number of experiments It was consecutive in nature, I did not
> find
> > such gap scenarios.
> >
> >
> > > >
> > > >
> > > > >
> > > > > And on the fix, you can clear the range information for the given
> > > > > regions in meta table, and then restart the clusters, I think the
> > > > > replication could continue.
> > > > >
> > > > >
> > > > If you mean removing some barriers so that replication is unblocked.
> > > > Doesn't it lead to *out of order events *replicated end up in
> > > > corrupting data?
> > > Yes, the replication will be out of order. But this is the easier way
> > > to recover the replication.
> > > If you still want to obtain the order, then you need to find out the
> > > root cause of my question, where is the WAL for the missing ranges. Is
> > > it because we have already replicated the data but do not mark the
> > > range as finished, or we just lose the WAL data for the range?
> > >
> >
> > Current scenario occurred in Active Passive cluster setup and replication
> > is stuck on the Passive side. So I won't be able to answer following
> > question
> >
> > we have already replicated the data
> >
> >
> >  But the following comment can help me find data from oldWAL if last
> > sequence id is present or not before region movement.
> >
> > but when reopening a region, we need to make sure there are no gaps
> > >
> >
> > Alternatively, do you think it is a good idea to write a job similar
> > to *RecoveredReplicationSource
> > *which ensures serial replication to other cluster outside of hbase
> cluster?
> >
> >
> > > >
> > > >
> > > > > Mallikarjun <mallik.v.ar...@gmail.com> 于2021年8月17日周二 下午3:04写道:
> > > > > >
> > > > > > I have got into the following scenario. I won't go into details
> of
> > > how I
> > > > > > got here, since I am not able to reliably reproduce this scenario
> > > thus
> > > > > far.
> > > > > > (Typically happens when some rs goes down because of hardware
> issues)
> > > > > >
> > > > > > Let me explain to you the following details.
> > > > > > Col 1: Region server on which region is trying to replicate
> > > > > > Col 2: Region trying to replicate but stuck
> > > > > > Col 3: SequenceID which is being replicated and stuck because
> > > previous
> > > > > > range is not finished
> > > > > > Col 4: Checkpoint in zk until which sequence id is already
> > > replicated to
> > > > > > peer
> > > > > > Col 5: Replication barriers for that region. This is a list of
> open
> > > > > > sequence IDs on region movement. (+++ means where *checkpoint*
> > > belongs,
> > > > > ---
> > > > > > is where *to replicate seqid* belongs)
> > > > > >
> > > > > > There are in total 53 regions and 10 regionservers
> > > > > >
> > > > > > RegionServer Region Trying to replicate sequenceID Replicated
> until
> > > > > Current
> > > > > > Barriers
> > > > > > rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209
> [17776286,
> > > +++
> > > > > > 18762210, 18775053, 18775079, 18775104, -- 18775119]
> > > > > > rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221
> > > [184183678, +++
> > > > > > 189117430, 189154191, -- 189154327]
> > > > > > rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510
> > > [180662218, +++
> > > > > > 189062798, 189099508, -- 189099587]
> > > > > > rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722
> > > [184170310, +++
> > > > > > 189078876, 189078950, -- 189104780, 189141509, 189141545,
> 189141595]
> > > > > > rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047
> > > [180239864, +++
> > > > > > 183265048, 183270357, 183277363, 183300886, 183301001, --
> 183301062]
> > > > > > rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047
> > > [180239864,
> > > > > +++
> > > > > > 183265048, 183270357, 183277363, 183300886, 183301001, 183301062
> --]
> > > > > > rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619
> > > [189022618, +++
> > > > > > 189027035, 189085155, 189085241, 189085290]
> > > > > > rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191
> > > [184126535, +++
> > > > > > 189098947, 189099080, -- 189099226]
> > > > > > rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365
> > > [184103269, +++
> > > > > > 189012453, -- 189012538, 189074967, 189075016, 189075294,
> 189075349]
> > > > > > rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365
> > > [184103269,
> > > > > +++
> > > > > > 189012453, 189012538, -- 189074967, 189075016, 189075294,
> 189075349]
> > > > > > rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129
> > > [180684344, +++
> > > > > > 189084916, 189121283, 189121425, -- 189121602]
> > > > > > rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436
> > > [184175242, +++
> > > > > > 189083026, 189145417, 189145457, -- 189145562, 189145723,
> 189145781]
> > > > > > rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879
> > > [184159187, +++
> > > > > > 189091684, 189128534, -- 189128708]
> > > > > > rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331
> > > [167737983, +++
> > > > > > 172392081, -- 172400093, 172446121, 172446172]
> > > > > > rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939
> > > [184119428,
> > > > > +++
> > > > > > 189027429, -- 189053118, 189089933, 189089995, 189090059]
> > > > > > rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879
> > > [184127026,
> > > > > +++
> > > > > > 189034143, --189048295, 189059834, 189096413, 189096513,
> 189096548,
> > > > > > 189096606]
> > > > > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639
> [18778549,
> > > +++
> > > > > > 18784783, 18793471, 18793484, 18793500 --]
> > > > > > rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639
> [18778549,
> > > +++
> > > > > > 18784783, 18793471, --- 18793484, 18793500]
> > > > > > rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357
> > > [184116531, +++
> > > > > > 189025318, 189028497, --- 189051176, 189087488, 189087737,
> 189087850]
> > > > > > rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338
> > > [184130122, +++
> > > > > > 189064848, --- 189101485, 189101698, 189101774]
> > > > > >
> > > > > >
> > > > > > My question is, how do I recover from here? Any suggestions.
> > > > > >
> > > > > > Only thought is that I have to replay by writing some MR jobs /
> some
> > > > > > scripts to read and replay selectively and update checkpoints.
> > > > > >
> > > > > > ---
> > > > > > Mallikarjun
> > > > >
> > >
>

Reply via email to