[jira] [Commented] (HBASE-27707) Region replica replication sometimes orphans WAL queue entries during recovery

Nick Dimiduk (Jira) Fri, 24 Mar 2023 08:06:55 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704656#comment-17704656
 ]


Nick Dimiduk commented on HBASE-27707:
--------------------------------------

{quote}
I suppose the queue should be in memory only? We should not record them on zk? 
Oh, maybe we still need them to prevent wal cleaner deletes them...
{quote}

Yes, I think we want to keep them in ZK for this reason.

> Region replica replication sometimes orphans WAL queue entries during recovery
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-27707
>                 URL: https://issues.apache.org/jira/browse/HBASE-27707
>             Project: HBase
>          Issue Type: Bug
>          Components: read replicas, Replication
>    Affects Versions: 2.5.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> Running with timeline-consistent read replicas and 
> {{hbase.region.replica.replication.enabled=true}}, we're seeing some region 
> servers have WAL queue entires that never clear. This appears to correlate 
> with SCP and recovery of replication queues. The result is WALs that build 
> up, consuming dangerous amounts of space on HDFS. Remediation requires 
> disabling and removing the {{region_replica_replication}} peer, which forces 
> an impacted region server to abort with the message "Failed to operate on 
> replication queue". We then delete the zk entry, which unlocks the WAL and 
> the cleaner chore can sweep them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27707) Region replica replication sometimes orphans WAL queue entries during recovery

Reply via email to