[GitHub] [hbase] bharathv commented on a change in pull request #3009: HBASE-25627: HBase replication should have a metric to represent if it cannot talk to peer's zk

GitBox Wed, 03 Mar 2021 13:46:23 -0800


bharathv commented on a change in pull request #3009:
URL: https://github.com/apache/hbase/pull/3009#discussion_r586812579




##########
File path: 
hbase-hadoop2-compat/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsReplicationGlobalSourceSource.java
##########
@@ -197,6 +205,20 @@ public void incrFailedRecoveryQueue() {
     failedRecoveryQueue.incr(1L);
   }
 
+  @Override
+  public void setPeerZkConnectionFailures(boolean success) {

Review comment:
       This logic in a global source doesn't give the intended result here 
because `peerZkConnectionFailures` is reset to 0 if _any_ source succeeds. That 
doesn't seem right.

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
##########
@@ -293,9 +293,12 @@ public void run() {
     while (this.isSourceActive() && this.peerClusterId == null) {
       this.peerClusterId = replicationEndpoint.getPeerUUID();
       if (this.isSourceActive() && this.peerClusterId == null) {
+        metrics.setPeerZkConnectionFailures(false);

Review comment:
       Had a brief discussion offline with @sandeepvinayak on how to generalize 
this. I think the intent here is capture and flag any sources that are stuck 
during initialization and ZK connection failure is just a symptom. So capturing 
those ZK failure count may not add much value, instead we track _number of 
uninitialized sources_, (gauge) that'd be much more helpful.
   
   So one way forward is to track number of such uninitialized sources at a 
global scope (that the monitoring tooling can flag if its > 0 for a time 
window) and then backport https://issues.apache.org/jira/browse/HBASE-22731 to 
branch-1. These two together should help us narrow down it to the right RS and 
root cause.
   
   Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hbase] bharathv commented on a change in pull request #3009: HBASE-25627: HBase replication should have a metric to represent if it cannot talk to peer's zk

Reply via email to