[jira] [Commented] (HBASE-21505) Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

Viraj Jasani (Jira) Mon, 30 May 2022 16:43:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544085#comment-17544085
 ]


Viraj Jasani commented on HBASE-21505:
--------------------------------------

For ReplicationSourceWALReader#addEntryToBatch, empty Edit entry log was 
changed from TRACE/DEBUG as part of this Jira to improve the visibility 
apparently:
{code:java}
// returns true if we reach the size limit for batch, i.e, we need to finish 
the batch and return.
protected final boolean addEntryToBatch(WALEntryBatch batch, Entry entry) {
  WALEdit edit = entry.getEdit();
  if (edit == null || edit.isEmpty()) {
    LOG.debug("Edit null or empty for entry {} ", entry);  <===  trace to debug
    return false;
  }
  LOG.debug("updating TimeStampOfLastAttempted to {}, from entry {}, for source 
queue: {}",
    entry.getKey().getWriteTime(), entry.getKey(), this.source.getQueueId());
...
...
... {code}
 

One of the clusters where I am running heavy workload, I keep seeing these logs 
with very high frequency:
{code:java}
2022-05-30 23:28:41,486 DEBUG [20%2C1653951316540,1] 
regionserver.ReplicationSourceWALReader - Filtered entry for replication: 
HBASE.T1/1da27f763d577cfee6f2a04f2e8832c4/3009414=[#edits: 0 = <>]
2022-05-30 23:28:41,486 DEBUG [20%2C1653951316540,1] 
regionserver.ReplicationSourceWALReader - Edit null or empty for entry 
HBASE.T1/1da27f763d577cfee6f2a04f2e8832c4/3009414=[#edits: 0 = <>] 
2022-05-30 23:28:41,486 DEBUG [20%2C1653951316540,1] 
regionserver.ReplicationSourceWALReader - Filtered entry for replication: 
HBASE.T1/e3bee00021060fe7df906c4fda5b593e/2873932=[#edits: 0 = <>]
2022-05-30 23:28:41,486 DEBUG [20%2C1653951316540,1] 
regionserver.ReplicationSourceWALReader - Edit null or empty for entry 
HBASE.T1/e3bee00021060fe7df906c4fda5b593e/2873932=[#edits: 0 = <>]  {code}
[~wchevreuil] To avoid some noise, do you think it might be good idea to keep 
both of these logs at TRACE level given that they are for empty/null entry only?

> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21505
>                 URL: https://issues.apache.org/jira/browse/HBASE-21505
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 3.0.0-alpha-1, 2.2.0
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>         Attachments: 
> 0001-HBASE-21505-initial-version-for-more-detailed-report.patch, 
> HBASE-21505-branch-2.001.patch, HBASE-21505-branch-2.001.patch, 
> HBASE-21505-branch-2.002.patch, HBASE-21505-branch-2.002.patch, 
> HBASE-21505-master.001.patch, HBASE-21505-master.002.patch, 
> HBASE-21505-master.003.patch, HBASE-21505-master.004.patch, 
> HBASE-21505-master.005.patch, HBASE-21505-master.006.patch, 
> HBASE-21505-master.007.patch, HBASE-21505-master.008.patch, 
> HBASE-21505-master.009.patch, HBASE-21505-master.010.patch, 
> HBASE-21505-master.011.patch, HBASE-21505-master.011.patch
>
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output format 
> gives the impression there’s only one, even when there are recovery queues, 
> for instance. 
> Here is a list of ideas on how the command should report under different 
> states of replication:
> a) Source started, target stopped, no edits arrived on source yet: 
> Status replication should not show any lags, no edits shipped, no edits 
> arrived;
> b) Source started, target stopped, add edit on source:
> Status replication should report following info -> lag, time of edit arrival 
> on source, additional message saying no edits had been shipped to target;
> c) Source started, target stopped, edit added on source, restart source:
> Status replication should list two sources, one normal, other recovered. 
> Normal source should show no lags, no edits shipped, no edits arrived. 
> Recovered should show no edits shipped, but should have edits arrived in 
> source and lag > 0;
> d) Source started, target stopped, add edit on source, restart source, add 
> another edit on source:
> Status replication should list two sources, one normal, other recovered. Both 
> sources should show no edits shipped, but should have edits arrived in source 
> and lag > 0;
> e) Source started, target stopped, add edit on source, restart source, add 
> another edit on source, start target:
> Status replication should list normal source only (after some short period), 
> with proper times for last shipped, last arrived in source and no replication 
> lag.
> f) Source started, target stopped, add edit on source, restart source, 
> restart target:
> Status replication should list normal source only, with no shipped, nor 
> arrived edits, and lag should be 0;



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HBASE-21505) Several inconsistencies on information reported for Replication Sources by hbase shell status 'replication' command.

Reply via email to