[
https://issues.apache.org/jira/browse/HBASE-26849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511019#comment-17511019
]
tianhang tang commented on HBASE-26849:
---------------------------------------
I have tried use resetReader instead of createReader, and make sure clear cache
in resetReader, but failed:
{code:java}
2022-03-22 18:57:15,966 ERROR
[RS_OPEN_REGION-hbase00C-test:16020-0.replicationSource.replicationWALReaderThread.hbase00c-test.di.io%2C16020%2C1647946596366.regiongroup-1,111]
regionserver.ReplicationSourceWALReaderThread: Failed to read stream of
replication entries:
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream$WALEntryStreamRuntimeException:
java.lang.IndexOutOfBoundsException: index (0) must be less than size (0)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:108)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReaderThread.run(ReplicationSourceWALReaderThread.java:128)
Caused by: java.lang.IndexOutOfBoundsException: index (0) must be less than
size (0)
at
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:305)
at
com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:284)
at
org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:140)
at
org.apache.hadoop.hbase.io.util.LRUDictionary.getEntry(LRUDictionary.java:43)
at
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.uncompressByteString(WALCellCodec.java:177)
at
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$1.uncompress(WALCellCodec.java:61)
at org.apache.hadoop.hbase.wal.WALKey.readFieldsFromPb(WALKey.java:578)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:377)
at
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104)
at
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:87)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndSetPosition(WALEntryStream.java:281)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:197)
at
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:106)
... 1 more
{code}
This might because we have some other places to use resetReader, and should not
clear the cache...
After all, I think that the ROI of how to fix this problem on the existing
basis is relatively low. On my cluster, as a temporary solution, the existing
patch can cover this problem to a large extent. We might as well open another
issue and consider how to fundamentally refactor Dict. [~zhangduo] [~Xiaolin
Ha] [~apurtell]
> NPE caused by WAL Compression and Replication
> ---------------------------------------------
>
> Key: HBASE-26849
> URL: https://issues.apache.org/jira/browse/HBASE-26849
> Project: HBase
> Issue Type: Bug
> Components: Replication, wal
> Affects Versions: 1.7.1, 3.0.0-alpha-2, 2.4.11
> Reporter: tianhang tang
> Assignee: tianhang tang
> Priority: Critical
> Attachments: image-2022-03-16-14-25-49-276.png,
> image-2022-03-16-14-30-15-247.png
>
>
> My cluster uses HBase 1.4.12, opened WAL compression and replication.
> I could found replication sizeOfLogQueue backlog, and after some debugs,
> found the NPE throwed by
> [https://github.com/apache/hbase/blob/branch-1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/util/LRUDictionary.java#L109:]
> !image-2022-03-16-14-25-49-276.png!
>
> The root cause for this problem is:
> WALEntryStream#checkAllBytesParsed:
> !image-2022-03-16-14-30-15-247.png!
> resetReader does not create a new reader, the original CompressionContext and
> the dict in it will still be retained.
> However, at this time, the position is reset to 0, which means that the HLog
> needs to be read from the beginning, but the cache that has not been cleared
> is still used, so there will be problems: the same data has already in the
> LRUCache, and it will be directly added to the cache again.
> Recreate a new reader here, the problem is solved.
> I will open a PR later. But, there are some other places in the current code
> to resetReader or seekOnFs. I guess these codes doesn't take into account the
> wal compression case at all...
>
> In theory, as long as the file is read again, the LRUCache should also be
> rolled back, otherwise there will be inconsistent behavior of READ and WRITE
> links.
> But the position can be roll back to any intermediate position at will, but
> LRUCache can't...
--
This message was sent by Atlassian Jira
(v8.20.1#820001)