Nick Dimiduk created HBASE-27707:
------------------------------------
Summary: Region replica replication sometimes orphans WAL queue
entries during recovery
Key: HBASE-27707
URL: https://issues.apache.org/jira/browse/HBASE-27707
Project: HBase
Issue Type: Bug
Components: read replicas, Replication
Affects Versions: 2.5.0
Reporter: Nick Dimiduk
Running with timeline-consistent read replicas and
{{hbase.region.replica.replication.enabled=true}}, we're seeing some region
servers have WAL queue entires that never clear. This appears to correlate with
SCP and recovery of replication queues. The result is WALs that build up,
consuming dangerous amounts of space on HDFS. Remediation requires disabling
and removing the {{region_replica_replication}} peer, which forces an impacted
region server to abort with the message "Failed to operate on replication
queue". We then delete the zk entry, which unlocks the WAL and the cleaner
chore can sweep them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)