[ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16699375#comment-16699375
 ] 

Wellington Chevreuil commented on HBASE-21505:
----------------------------------------------

Thanks for the comments [~tianjingyun], please refer to below:

{quote}How do you calculate replication lag when there is no entry need to 
replicate for this peer? I saw you remove the code that update 
TimeStampOfLastShippedOp{quote}
*TimeStampOfLastShippedOp* is still used for calculating replication lag, and 
it's still updated on ReplicationSourceShipper only, once it has successfully 
shipped the given edit. I had added this TimeStampOfLastAttempted to track the 
timestamp of the latest entry added to source that should be replicated, so 
that I can compare with shipment times and calculate the lag. As long as 
shipment time is newer than time from latest added to source, there's no lag. 
If arrival time is newer, then the lag is defined as: (current time - arrival 
time);

{quote}getTimeStampOfLastAttemtped what is this for? I didn't see any place 
that update this metric.{quote}
As explained before, it is being used to track timestamps of insertion, in 
source, of last edit targeted for replication. I should rename this to 
something more intuitive, such as TimeStampOfLastAddedToSource. It's updated on 
ReplicationSourceWALReader.addEntryToBatch whenever new entry is read and 
placed in the queue to be consumed by Shipper. It may be kept internally only, 
though. Not sure if it's worth expose it on ;

{quote}same question for connectedToPeer{quote}
In the event where peer ZK is not available, ReplicationSource initialization 
can get stuck in trying to connect to ZK, before it even starts trying to 
read/ship edits, so thought about add this metric to give more insight on what 
might be wrong with replication. Since workers had not really started, I 
thought it wouldn't make much sense show 0 valued stats, but rather explicit 
mention some initialization issues.

Yeah, patch attached was not yet intended for a commit. Still need to polish a 
bit on variable names, what actually expose for hbase shell command and also 
implement tests for the different conditions. Am planning to add an initial 
patch version addressing that by tomorrow EOD. Let me know on any concerns you 
might have about what we had discussed so far.
 

 


> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21505
>                 URL: https://issues.apache.org/jira/browse/HBASE-21505
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>         Attachments: 
> 0001-HBASE-21505-initial-version-for-more-detailed-report.patch
>
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output format 
> gives the impression there’s only one, even when there are recovery queues, 
> for instance. 
> Here is a list of ideas on how the command should report under different 
> states of replication:
> a) Source started, target stopped, no edits arrived on source yet: 
> Status replication should not show any lags, no edits shipped, no edits 
> arrived;
> b) Source started, target stopped, add edit on source:
> Status replication should report following info -> lag, time of edit arrival 
> on source, additional message saying no edits had been shipped to target;
> c) Source started, target stopped, edit added on source, restart source:
> Status replication should list two sources, one normal, other recovered. 
> Normal source should show no lags, no edits shipped, no edits arrived. 
> Recovered should show no edits shipped, but should have edits arrived in 
> source and lag > 0;
> d) Source started, target stopped, add edit on source, restart source, add 
> another edit on source:
> Status replication should list two sources, one normal, other recovered. Both 
> sources should show no edits shipped, but should have edits arrived in source 
> and lag > 0;
> e) Source started, target stopped, add edit on source, restart source, add 
> another edit on source, start target:
> Status replication should list normal source only (after some short period), 
> with proper times for last shipped, last arrived in source and no replication 
> lag.
> f) Source started, target stopped, add edit on source, restart source, 
> restart target:
> Status replication should list normal source only, with no shipped, nor 
> arrived edits, and lag should be 0;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to