[ https://issues.apache.org/jira/browse/HBASE-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704656#comment-17704656 ]
Nick Dimiduk commented on HBASE-27707: -------------------------------------- {quote} I suppose the queue should be in memory only? We should not record them on zk? Oh, maybe we still need them to prevent wal cleaner deletes them... {quote} Yes, I think we want to keep them in ZK for this reason. > Region replica replication sometimes orphans WAL queue entries during recovery > ------------------------------------------------------------------------------ > > Key: HBASE-27707 > URL: https://issues.apache.org/jira/browse/HBASE-27707 > Project: HBase > Issue Type: Bug > Components: read replicas, Replication > Affects Versions: 2.5.0 > Reporter: Nick Dimiduk > Priority: Critical > > Running with timeline-consistent read replicas and > {{hbase.region.replica.replication.enabled=true}}, we're seeing some region > servers have WAL queue entires that never clear. This appears to correlate > with SCP and recovery of replication queues. The result is WALs that build > up, consuming dangerous amounts of space on HDFS. Remediation requires > disabling and removing the {{region_replica_replication}} peer, which forces > an impacted region server to abort with the message "Failed to operate on > replication queue". We then delete the zk entry, which unlocks the WAL and > the cleaner chore can sweep them. -- This message was sent by Atlassian Jira (v8.20.10#820010)