Sid Khillon created HBASE-29987:
-----------------------------------
Summary: Replication position corruption when WAL file switch
detected in ReplicationSourceWALReader run loop
Key: HBASE-29987
URL: https://issues.apache.org/jira/browse/HBASE-29987
Project: HBase
Issue Type: Bug
Components: Replication, wal, Zookeeper
Reporter: Sid Khillon
When {{ReplicationSourceWALReader.run()}} detects a WAL file switch via the
{{switched()}} check at line 160, it enqueues an EOF batch but does not update
{{{}currentPosition{}}}. If the outer loop subsequently restarts (e.g., due to
{{{}WALEntryFilterRetryableException{}}}), the new {{WALEntryStream}} is
created with the stale position from the old WAL file, which gets applied to
the new WAL file. This causes the reader to enter an infinite retry loop
attempting to seek to an invalid position, permanently stalling replication.
The {{switched()}} path at line 160 fires when {{readWALEntries()}} returns a
batch without seeing EOF — either because batch capacity was reached, or
because an error (e.g., NameNode timeout) caused {{hasNext()}} inside
{{readWALEntries()}} to return RETRY, breaking the loop early. The next
{{hasNext()}} at line 153 then detects EOF, dequeues the old file, and returns
{{{}RETRY_IMMEDIATELY{}}}. The {{switched()}} check fires because
{{{}currentPath{}}}(captured before {{{}hasNext(){}}}) was the old file, but
the stream’s path is now null after the dequeue. {{currentPosition}} is not
updated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)