[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

Duo Zhang (Jira) Thu, 01 Jul 2021 20:09:13 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373167#comment-17373167
 ]


Duo Zhang commented on HBASE-25596:
-----------------------------------

The patch for HBASE-25596 actually fixed the bug. The only left problem is that 
it did not follow some ideas which we introduced when refactoring this part of 
code for 2.x, which made the code a bit hard to read and understand, and also 
introduced the problem described in HBASE-25985. So I opened HBASE-25992 to 
polish the code on 2.x, to make it easier to read and understand, and also fix 
the problem of HBASE-25985 as a side effect.

Thanks.

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-25596
>                 URL: https://issues.apache.org/jira/browse/HBASE-25596
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Critical
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25596) Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated data due to EOFException from WAL

Reply via email to