[jira] [Commented] (HBASE-26849) NPE caused by WAL Compression and Replication

tianhang tang (Jira) Tue, 22 Mar 2022 22:02:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511019#comment-17511019
 ]


tianhang tang commented on HBASE-26849:
---------------------------------------

I have tried use resetReader instead of createReader, and make sure clear cache 
in resetReader, but failed:
{code:java}
2022-03-22 18:57:15,966 ERROR 
[RS_OPEN_REGION-hbase00C-test:16020-0.replicationSource.replicationWALReaderThread.hbase00c-test.di.io%2C16020%2C1647946596366.regiongroup-1,111]
 regionserver.ReplicationSourceWALReaderThread: Failed to read stream of 
replication entries:
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
 java.lang.IndexOutOfBoundsException: index (0) must be less than size (0)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:128)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less than 
size (0)
        at 
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:305)
        at 
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:284)
        at 
org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:140)
        at 
org.apache.hadoop.hbase.io.util.LRUDictionary.getEntry(LRUDictionary.java:43)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.uncompressByteString(WALCellCodec.java:177)
        at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$1.uncompress(WALCellCodec.java:61)
        at org.apache.hadoop.hbase.wal.WALKey.readFieldsFromPb(WALKey.java:578)
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:377)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:87)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndSetPosition(WALEntryStream.java:281)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:197)
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:106)
        ... 1 more
{code}

This might because we have some other places to use resetReader, and should not 
clear the cache...

After all, I think that the ROI of how to fix this problem on the existing 
basis is relatively low. On my cluster, as a temporary solution, the existing 
patch can cover this problem to a large extent. We might as well open another 
issue and consider how to fundamentally refactor Dict. [~zhangduo] [~Xiaolin 
Ha] [~apurtell]

> NPE caused by WAL Compression and Replication
> ---------------------------------------------
>
>                 Key: HBASE-26849
>                 URL: https://issues.apache.org/jira/browse/HBASE-26849
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>    Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.11
>            Reporter: tianhang tang
>            Assignee: tianhang tang
>            Priority: Critical
>         Attachments: image-2022-03-16-14-25-49-276.png, 
> image-2022-03-16-14-30-15-247.png
>
>
> My cluster uses HBase 1.4.12, opened WAL compression and replication.
> I could found replication sizeOfLogQueue backlog, and after some debugs, 
> found the NPE throwed by 
> [https://github.com/apache/hbase/blob/branch-1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/LRUDictionary.java#L109:]
> !image-2022-03-16-14-25-49-276.png!
>  
> The root cause for this problem is:
> WALEntryStream#checkAllBytesParsed:
> !image-2022-03-16-14-30-15-247.png!
> resetReader does not create a new reader, the original CompressionContext and 
> the dict in it will still be retained.
> However, at this time, the position is reset to 0, which means that the HLog 
> needs to be read from the beginning, but the cache that has not been cleared 
> is still used, so there will be problems: the same data has already in the 
> LRUCache, and it will be directly added to the cache again.
> Recreate a new reader here, the problem is solved.
> I will open a PR later. But, there are some other places in the current code 
> to resetReader or seekOnFs. I guess these codes doesn't take into account the 
> wal compression case at all...
>  
> In theory, as long as the file is read again, the LRUCache should also be 
> rolled back, otherwise there will be inconsistent behavior of READ and WRITE 
> links.
> But the position can be roll back to any intermediate position at will, but 
> LRUCache can't...



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HBASE-26849) NPE caused by WAL Compression and Replication

Reply via email to