sidkhillon opened a new pull request, #7909:
URL: https://github.com/apache/hbase/pull/7909

   When ReplicationSourceWALReader.run() detects a WAL file switch via the 
switched() check at line 160, it enqueues an EOF batch but does not update 
currentPosition. If the outer loop subsequently restarts (e.g., due to 
WALEntryFilterRetryableException), the new WALEntryStream is created with the 
stale position from the old WAL file, which gets applied to the new WAL file. 
This causes the reader to enter an infinite retry loop attempting to seek to an 
invalid position, permanently stalling replication.
   
   The switched() path at line 160 fires when readWALEntries() returns a batch 
without seeing EOF — either because batch capacity was reached, or because an 
error (e.g., NameNode timeout) caused hasNext() inside readWALEntries() to 
return RETRY, breaking the loop early. The next hasNext() at line 153 then 
detects EOF, dequeues the old file, and returns RETRY_IMMEDIATELY. The 
switched() check fires because currentPath(captured before hasNext()) was the 
old file, but the stream’s path is now null after the dequeue. currentPosition 
is not updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to