Re: Replication Lag Issue in HBase DR Cluster after Upgrade

Duo Zhang Sun, 20 Aug 2023 00:26:28 -0700

If it is just a metrics issue then HBASE-22784 won't help. I guess the
problem is that the replication lag is calculated by comparing the
current time and the time when we ship the last edit, so if there is
no new edit, the replication lag will keep growing.


Looking at the current code

https://github.com/apache/hbase/blob/dae078e5bc342012b49cd066027eb53ae9a21280/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsSource.java#L341

  public long getReplicationDelay() {
    if (getTimestampOfLastShippedOp() >= timeStampNextToReplicate) {
      return 0;
    } else {
      return EnvironmentEdgeManager.currentTime() - timeStampNextToReplicate;
    }
  }

It has a if condition to check whether there are actual edits to
replicate to avoid false alarming, which is added by HBASE-21505.

The code for branch-1.4 is completely different, and since all hbase
1.x version have been EOL for quite some time, I'm not sure what is
the easier way to fix the problem, maybe you need to read the code a
bit more carefully to see how to add the above check in 1.x code line.

Thanks.

Valli <kmanimeka...@gmail.com> 于2023年8月17日周四 23:14写道：
>
> Hi Duo Zhang
>
> Its just metrics. Because in that cluster, there is no active write. So we
> don't have any data to replicate to the another cluster.
>
>
> On Wed, 16 Aug 2023 at 08:01, 张铎(Duo Zhang) <palomino...@gmail.com> wrote:
>
> > Is this just a metrics issue or is there an actual replication lag?
> >
> > Valli <kmanimeka...@gmail.com> 于2023年8月11日周五 22:51写道：
> > >
> > > Hello HBase Community,
> > >
> > > We recently upgraded our HBase cluster from version 1.2.6 to 1.4.14 and
> > > have encountered an issue with replication lag in our Disaster Recovery
> > > (DR) cluster. We have two clusters in our setup: an active write cluster
> > > and a DR cluster that receives replication from the active cluster. The
> > > replication lag in the DR cluster has been building up, even though there
> > > are no direct writes to it.
> > >
> > > Here's a brief overview of the problem:
> > > - We have an active write cluster with no replication lag.
> > > - The DR cluster only receives replication from the active cluster and
> > > doesn't have direct writes.
> > > - Replication lag builds up in the DR cluster over time, even though
> > there
> > > is no active write.
> > > - When a 'put' call is made in the DR cluster, the replication lag
> > reduces
> > > momentarily, but then starts building up .
> > >
> > > We have experienced similar kind of issue in 1.4.9 version in another
> > > cluster.  We used the below patch for it.
> > >
> > > https://issues.apache.org/jira/browse/HBASE-22784
> > >
> > > But 1.4.14 version contains above patch but still we experience issue.
> > >
> > > If there are any specific configurations or adjustments we should be
> > making
> > > to address this problem. It's important for us to maintain a reliable DR
> > > setup, and any guidance or insights you can provide would be greatly
> > > appreciated.
> > >
> > > If anyone has experienced a similar issue after upgrading HBase or has
> > any
> > > recommendations on how to troubleshoot and resolve replication lag in a
> > DR
> > > cluster, please share your thoughts.
> > >
> > > Thank you in advance for your time and assistance. Your expertise and
> > > insights are invaluable to us as we work to resolve this issue and
> > maintain
> > > the stability of our HBase setup.
> > >
> > > Best regards,
> > > Manimekalai K
> > > --
> > > *Regards,*
> > > *Manimekalai K*
> >

Re: Replication Lag Issue in HBase DR Cluster after Upgrade

Reply via email to