Sandeep Pal created HBASE-25741:
-----------------------------------
Summary: Replication Source still having the replication metrics
for peer ID which doesn't exist.
Key: HBASE-25741
URL: https://issues.apache.org/jira/browse/HBASE-25741
Project: HBase
Issue Type: Bug
Affects Versions: 1.8.0
Reporter: Sandeep Pal
Assignee: Sandeep Pal
We have observed that replication source metrics for peer exists on some region
servers even though peer has been removed. This is because when we encounter
the NoNodeException in ReplicationSource, it calls the `peerRemoved` workflow
which should eventually terminate the source and removes the source from the
source manager. Now, the problem is ReplicationSource thread terminates itself
and thus the action to removePeer is not complete leaving the metrics there
forever for source. This is the flow, replication source trying to clean wals
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L801]
and on NoNodeException it calls the
[peerRemoved|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L244]
and terminate the source (itself), leaving the terminated source there in
sourcemanager and not clearing it's
[metrics|https://github.com/apache/hbase/blob/b231dd620f107b488b88599e16dc846eb856972c/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceManager.java#L645].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)