Wellington Chevreuil created HBASE-27871:
--------------------------------------------
Summary: Meta replication stuck forever if wal it's still reading
gets rolled and deleted
Key: HBASE-27871
URL: https://issues.apache.org/jira/browse/HBASE-27871
Project: HBase
Issue Type: Bug
Components: meta replicas
Affects Versions: 2.5.4, 2.4.17
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
This affects branch-2 based releases only (in master, HBASE-26416 refactored
region replication to not rely on the replication framework anymore).
Per the original [meta region replicas
design|https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit],
we use most of the replication framework for communicating changes in the
primary replica back to the secondary ones, but we skip storing the queue state
in ZK. In the event of a region replication crash, we should let the related
replication source thread be interrupted, so that
RegionReplicaReplicationEndpoint would set a new source from the scratch and
make sure to update the secondary replicas.
We have run into a situation in one of our customers' cluster where the region
replica source faced a long lag (probably because the RSes hosting the
secondary replicas were busy and slower in processing the region replication
entries), so that the current wal got rolled and eventually deleted whilst the
replication source reader was still referring it. In such cases,
ReplicationSourceReader only sees the IOException and keeps retrying the read
indefinitely, but since the file is now gone, it will just get stuck there
forever. In the particular case of FNFE (which I believe would only happen for
region replication), we should just raise an exception and let
RegionReplicaReplicationEndpoint handle it to reset the region replication
source.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)