[ 
https://issues.apache.org/jira/browse/HBASE-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695837#comment-16695837
 ] 

Wellington Chevreuil commented on HBASE-21505:
----------------------------------------------

Thanks for the comments [~tianjingyun]! 
{quote}these metrics are mainly to indicate if current replication source is 
stuck.That's why we need to update the TimeStampsOfLastShippedOp even there is 
no log replicated
{quote}
Yeah, my proposal here is to make these metrics consistent both on the 
behaviour and name accuracy, currently for example, both 
TimeStampsOfLastShippedOp and AgeOfLastShippedOp are misleading, and it also 
seems to have some bugs on its calculation (see #4 from original description).
{quote}And AgeOfLastShippedOp = current time stamp - timeStampOfLastShippedOp
{quote}
I understand that, and I think this is one the things that makes these metrics 
inconsistent, as sometimes it may not really be the age of last shipped if we 
keep updating *timeStampOfLastShippedOp* even when no edits came in. This will 
cause this metric to sometimes report one thing (*timeStampOfLastShippedOp* was 
really updated with a new edit), other times it will report nothing (when 
*timeStampOfLastShippedOp* just gets updated because of another cycle of the 
source run, but no edits came in).

And things get even worse when recovery queues are in the picture, but we don't 
show those. Have seen lots of confusion from users, and was thinking in 
refining the command, together with new metrics, to give a more detailed and 
intuitive output about replication status.
{quote}Are you planning add more metrics to help calculate replication lag? My 
concern is we need to make sure the replication lag could really indicate the 
status of replication is normal or not.
{quote}
Yes, per my previous comments, idea is to provide a more intuitive "feedback" 
about replication source state under different scenarios. The main metric to 
identify stuck replications would still be *Replication Lag*. A new metric to 
track when a last edit came into source would be added, this would be 
*TimeStampOfLastArrivedInSource*, per my previous comment suggestions.  
*TimeStampOfLastArrivedInSource* would be used to calculate the lag, so that we 
don't need to keep updating *timeStampOfLastShippedOp* (even when no edit was 
really shipped) anymore. It would allow us to consistently report the actual 
time an edit was last shipped to target. *AgeOfLastShippedOp* would also be 
more consistent, as it would show how much time last edit took between when it 
got into source and when it was shipped to target. These would also give 
additional facts on how replication is performing, for example, in bullet #*e* 
from my previous comment, we show there's no lag, and last shipped spent 30 
secs to complete replication. Also, I think it avoids confusion to explicit say 
if no edits were ever shipped or arrived in source since last source RS restart 
than reporting an arbitrary date (in this case, RS restart date/replication 
source last run date).

I'm still working on this, currently need to perform more tests, but can share 
an initial patch for review.

> Several inconsistencies on information reported for Replication Sources by 
> hbase shell status 'replication' command.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21505
>                 URL: https://issues.apache.org/jira/browse/HBASE-21505
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>
> While reviewing hbase shell status 'replication' command, noticed the 
> following issues related to replication source section:
> 1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when 
> no new edits were added to source, so nothing was really shipped. Test steps 
> performed:
> 1.1) Source cluster with only one table targeted to replication;
> 1.2) Added a new row, confirmed the row appeared in Target cluster;
> 1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
> shows current timestamp T1.
> 1.4) Waited 30 seconds, no new data added to source. Issued status 
> 'replication' command, now shows timestamp T2.
> 2) When replication is stuck due some connectivity issues or target 
> unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
> is wrongly showing same value as "Replication Lag". This is incorrect, 
> AgeOfLastShippedOp should not change until there's indeed another edit 
> shipped to target. Test steps performed:
> 2.1) Source cluster with only one table targeted to replication;
> 2.2) Stopped target cluster RS;
> 2.3) Put a new row on source. Running status 'replication' command does show 
> lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
> updates as described on bullet #1 above.
> 2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some 
> time before it got finally shipped to target. Test steps performed:
> 3.1) Source cluster with only one table targeted to replication;
> 3.2) Stopped target cluster RS;
> 3.3) Put a new row on source. 
> 3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
> though there's no new edit shipped to target:
> {noformat}
> T1:
> ...
>  SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
> ...
> T2:
> ...
> SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
> ...
> {noformat}
> 3.5) Restart target cluster RS and verified the new row appeared there. No 
> new edit added, but status 'replication' command reports AgeOfLastShippedOp 
> as 0, while it should be the diff between the time it concluded shipping at 
> target and the time it was added in source:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
> {noformat}
> 4) When replication is stuck due some connectivity issues or target 
> unavailability, if RS is restarted, once recovered queue source is started, 
> TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 
> GMT 1970, for example), thus "Replication Lag" also gives a complete 
> inaccurate value. 
> Tests performed:
> 4.1) Source cluster with only one table targeted to replication;
> 4.2) Stopped target cluster RS;
> 4.3) Put a new row on source, restart RS on source, waited a few seconds for 
> recovery queue source to startup, then it gives:
> {noformat}
> SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
> TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
> Lag=9223372036854775807
> {noformat}
> Also, we should report status to all sources running, current output format 
> gives the impression there’s only one, even when there are recovery queues, 
> for instance. 
> Here is a list of ideas on how the command should report under different 
> states of replication:
> a) Source started, target stopped, no edits arrived on source yet: 
> Status replication should not show any lags, no edits shipped, no edits 
> arrived;
> b) Source started, target stopped, add edit on source:
> Status replication should report following info -> lag, time of edit arrival 
> on source, additional message saying no edits had been shipped to target;
> c) Source started, target stopped, edit added on source, restart source:
> Status replication should list two sources, one normal, other recovered. 
> Normal source should show no lags, no edits shipped, no edits arrived. 
> Recovered should show no edits shipped, but should have edits arrived in 
> source and lag > 0;
> d) Source started, target stopped, add edit on source, restart source, add 
> another edit on source:
> Status replication should list two sources, one normal, other recovered. Both 
> sources should show no edits shipped, but should have edits arrived in source 
> and lag > 0;
> e) Source started, target stopped, add edit on source, restart source, add 
> another edit on source, start target:
> Status replication should list normal source only (after some short period), 
> with proper times for last shipped, last arrived in source and no replication 
> lag.
> f) Source started, target stopped, add edit on source, restart source, 
> restart target:
> Status replication should list normal source only, with no shipped, nor 
> arrived edits, and lag should be 0;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to