Nick Dimiduk created HBASE-27707:
------------------------------------

             Summary: Region replica replication sometimes orphans WAL queue 
entries during recovery
                 Key: HBASE-27707
                 URL: https://issues.apache.org/jira/browse/HBASE-27707
             Project: HBase
          Issue Type: Bug
          Components: read replicas, Replication
    Affects Versions: 2.5.0
            Reporter: Nick Dimiduk


Running with timeline-consistent read replicas and 
{{hbase.region.replica.replication.enabled=true}}, we're seeing some region 
servers have WAL queue entires that never clear. This appears to correlate with 
SCP and recovery of replication queues. The result is WALs that build up, 
consuming dangerous amounts of space on HDFS. Remediation requires disabling 
and removing the {{region_replica_replication}} peer, which forces an impacted 
region server to abort with the message "Failed to operate on replication 
queue". We then delete the zk entry, which unlocks the WAL and the cleaner 
chore can sweep them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to