[
https://issues.apache.org/jira/browse/HBASE-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl updated HBASE-11143:
----------------------------------
Attachment: 11143-0.94.txt
Simple 0.94 patch. Sets the metric to current time when there's nothing to
replicate (and it's not due to an error).
When we're hanging somewhere because the slave cluster is down or replication
takes a very long time, the metric is still incremented, though. I think that's
OK, there might be a delay in that case, we just do not know.
It's also nice as we can large values of this metric as an indicator that
something is wrong.
> ageOfLastShippedOp confusing
> ----------------------------
>
> Key: HBASE-11143
> URL: https://issues.apache.org/jira/browse/HBASE-11143
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Lars Hofhansl
> Fix For: 0.94.20
>
> Attachments: 11143-0.94.txt
>
>
> We are trying to report on replication lag and find that there is no good
> single metric to do that.
> ageOfLastShippedOp is close, but unfortunately it is increased even when
> there is nothing to ship on a particular RegionServer.
> I would like discuss a few options here:
> Add a new metric: replicationQueueTime (or something) with the above meaning.
> I.e. if we have something to ship we set the age of that last shipped edit,
> if we fail we increment that last time (just like we do now). But if there is
> nothing to replicate we set it to current time (and hence that metric is
> reported to close to 0).
> Alternatively we could change the meaning of ageOfLastShippedOp to mean to do
> that. That might lead to surprises, but the current behavior is clearly weird
> when there is nothing to replicate.
> Comments? [~jdcryans], [~stack].
> If approach sounds good, I'll make a patch for all branches.
--
This message was sent by Atlassian JIRA
(v6.2#6252)