bharathv commented on a change in pull request #3009:
URL: https://github.com/apache/hbase/pull/3009#discussion_r586812579
##########
File path:
hbase-hadoop2-compat/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSource.java
##########
@@ -197,6 +205,20 @@ public void incrFailedRecoveryQueue() {
failedRecoveryQueue.incr(1L);
}
+ @Override
+ public void setPeerZkConnectionFailures(boolean success) {
Review comment:
This logic in a global source doesn't give the intended result here
because `peerZkConnectionFailures` is reset to 0 if _any_ source succeeds. That
doesn't seem right.
##########
File path:
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
##########
@@ -293,9 +293,12 @@ public void run() {
while (this.isSourceActive() && this.peerClusterId == null) {
this.peerClusterId = replicationEndpoint.getPeerUUID();
if (this.isSourceActive() && this.peerClusterId == null) {
+ metrics.setPeerZkConnectionFailures(false);
Review comment:
Had a brief discussion offline with @sandeepvinayak on how to generalize
this. I think the intent here is capture and flag any sources that are stuck
during initialization and ZK connection failure is just a symptom. So capturing
those ZK failure count may not add much value, instead we track _number of
uninitialized sources_, (gauge) that'd be much more helpful.
So one way forward is to track number of such uninitialized sources at a
global scope (that the monitoring tooling can flag if its > 0 for a time
window) and then backport https://issues.apache.org/jira/browse/HBASE-22731 to
branch-1. These two together should help us narrow down it to the right RS and
root cause.
Thoughts?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]