[
https://issues.apache.org/jira/browse/HBASE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wellington Chevreuil resolved HBASE-27871.
------------------------------------------
Resolution: Fixed
> Meta replication stuck forever if wal it's still reading gets rolled and
> deleted
> --------------------------------------------------------------------------------
>
> Key: HBASE-27871
> URL: https://issues.apache.org/jira/browse/HBASE-27871
> Project: HBase
> Issue Type: Bug
> Components: meta replicas
> Affects Versions: 2.6.0, 2.4.16, 2.4.17, 2.5.4
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6
>
>
> This affects branch-2 based releases only (in master, HBASE-26416 refactored
> region replication to not rely on the replication framework anymore).
> Per the original [meta region replicas
> design|https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit],
> we use most of the replication framework for communicating changes in the
> primary replica back to the secondary ones, but we skip storing the queue
> state in ZK. In the event of a region replication crash, we should let the
> related replication source thread be interrupted, so that
> RegionReplicaReplicationEndpoint would set a new source from the scratch and
> make sure to update the secondary replicas.
>
> We have run into a situation in one of our customers' cluster where the
> region replica source faced a long lag (probably because the RSes hosting the
> secondary replicas were busy and slower in processing the region replication
> entries), so that the current wal got rolled and eventually deleted whilst
> the replication source reader was still referring it. In such cases,
> ReplicationSourceReader only sees the IOException and keeps retrying the read
> indefinitely, but since the file is now gone, it will just get stuck there
> forever. In the particular case of FNFE (which I believe would only happen
> for region replication), we should just raise an exception and let
> RegionReplicaReplicationEndpoint handle it to reset the region replication
> source.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)