[ 
https://issues.apache.org/jira/browse/HBASE-25627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295674#comment-17295674
 ] 

Sandeep Pal commented on HBASE-25627:
-------------------------------------

Following up the discussion 
[here|https://github.com/apache/hbase/pull/3009#discussion_r586815267] , 
[~bharathv] suggestion is to have a metric at a source initialization level 
instead which will ultimately keep track of zk peer connection issue. 

We can keep track number of sources getting initialized and if they are stuck, 
there can be monitoring on the metric.

> HBase replication should have a metric to represent if it cannot talk to 
> peer's zk
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-25627
>                 URL: https://issues.apache.org/jira/browse/HBASE-25627
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Major
>
> There can be situation when the cluster is not able to talk to peer cluster 
> ZK, in that case, yes the logQueue will be accumulating but without digging 
> into the logs, we cannot know what's the reason of loqQueue getting 
> accumulating on the source. 
> Since the replication source doesn't even start the shipper in this case, it 
> is good to have a dedicated metric if the RS cannot talk to the peer's ZK at 
> all. 
>  
> {code:java}
> 2021-03-03 04:02:10,704 DEBUG [peerId] zookeeper.RecoverableZooKeeper - 
> Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid2021-03-03 04:02:10,704 DEBUG 
> [peerId] zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper, 
> quorum=zookeeper-0.zookeeper-a.fakeAddress:2181,zookeeper-1.zookeeper-a.fakeAddress:2181,zookeeper-2.zookeeper-a.fakeAddress:2181,zookeeper-3.zookeeper-a.fakeAddress:2181,zookeeper-4.zookeeper-a.fakeAddress:2181,
>  exception=org.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for 
> /hbase/hbaseidorg.apache.zookeeper.KeeperException$AuthFailedException: 
> KeeperErrorCode = AuthFailed for /hbase/hbaseid at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1119) at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:284)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:469) at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
>  at 
> org.apache.hadoop.hbase.zookeeper.ZKClusterId.getUUIDForCluster(ZKClusterId.java:96)
>  at 
> org.apache.hadoop.hbase.replication.HBaseReplicationEndpoint.getPeerUUID(HBaseReplicationEndpoint.java:104)
>  at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:306)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to