[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

Duo Zhang (Jira) Thu, 10 Jun 2021 00:05:13 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360609#comment-17360609
 ]


Duo Zhang commented on HBASE-25596:
-----------------------------------

But the patch has also been appled to master and branch-2.x as well?

On master and branch-2.x, a replication batch can not cross wal files.

{code}
    if (!entryStream.hasNext()) {
      // check whether we have switched a file
      if (currentPath != null && switched(entryStream, currentPath)) {
        return WALEntryBatch.endOfFile(currentPath);
      } else {
        // This would mean either no more files in the queue
        // or there is no new data yet on the current wal
        return null;
      }
    }
    if (currentPath != null) {
      if (switched(entryStream, currentPath)) {
        return WALEntryBatch.endOfFile(currentPath);
      }
    } else {
      // when reading from the entry stream first time we will enter here
      currentPath = entryStream.getCurrentPath();
    }
{code}

See the above code.

So I think we could set the fix version to 1.7.0 and then revert the patches 
for master and all branch-2.x?

Thanks.

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-25596
>                 URL: https://issues.apache.org/jira/browse/HBASE-25596
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Critical
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

Reply via email to