[jira] [Created] (HBASE-27871) Meta replication stuck forever if wal it's still reading gets rolled and deleted

Wellington Chevreuil (Jira) Wed, 17 May 2023 07:45:11 -0700

Wellington Chevreuil created HBASE-27871:
--------------------------------------------


             Summary: Meta replication stuck forever if wal it's still reading 
gets rolled and deleted
                 Key: HBASE-27871
                 URL: https://issues.apache.org/jira/browse/HBASE-27871
             Project: HBase
          Issue Type: Bug
          Components: meta replicas
    Affects Versions: 2.5.4, 2.4.17
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


This affects branch-2 based releases only (in master, HBASE-26416 refactored 
region replication to not rely on the replication framework anymore).

Per the original [meta region replicas 
design|https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit],
 we use most of the replication framework for communicating changes in the 
primary replica back to the secondary ones, but we skip storing the queue state 
in ZK. In the event of a region replication crash, we should let the related 
replication source thread be interrupted, so that 
RegionReplicaReplicationEndpoint would set a new source from the scratch and 
make sure to update the secondary replicas.
 
We have run into a situation in one of our customers' cluster where the region 
replica source faced a long lag (probably because the RSes hosting the 
secondary replicas were busy and slower in processing the region replication 
entries), so that the current wal got rolled and eventually deleted whilst the 
replication source reader was still referring it. In such cases, 
ReplicationSourceReader only sees the IOException and keeps retrying the read 
indefinitely, but since the file is now gone, it will just get stuck there 
forever. In the particular case of FNFE (which I believe would only happen for 
region replication), we should just raise an exception and let 
RegionReplicaReplicationEndpoint handle it to reset the region replication 
source.
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HBASE-27871) Meta replication stuck forever if wal it's still reading gets rolled and deleted

Reply via email to