Sandeep Pal created HBASE-25596:
-----------------------------------

             Summary: Fix NPE in ReplicationSourceManager as well as avoid 
permanently unreplicated data due to EOFException from WAL
                 Key: HBASE-25596
                 URL: https://issues.apache.org/jira/browse/HBASE-25596
             Project: HBase
          Issue Type: Bug
            Reporter: Sandeep Pal
            Assignee: Sandeep Pal


There seems to be a major issue with how we handle the EOF exception from 
WALEntryStream. 

Problem:

When we see EOFException, we try to handle it and remove it from the log queue, 
but we never try to ship the existing batch of entries. *This is a permanent 
data loss in replication.*

 

Secondly, we do not stop the reader on encountering the EOFException and thus 
if EOFException was on the last WAL, we still try to process the WALEntry 
stream and ship the empty batch with lastWALPath set to null. This is the 
reason of NPE as below. 
{code:java}
2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
regionserver.ReplicationSource - Unexpected exception in 
ReplicationSourceWorkerThread, currentPath=nulljava.lang.NullPointerExceptionat 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - STOPPED: 
Unexpected exception in ReplicationSourceWorkerThread
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to