[
https://issues.apache.org/jira/browse/HBASE-22539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang updated HBASE-22539:
------------------------------
Release Note:
We found a critical bug which can lead to WAL corruption when
Durability.ASYNC_WAL. The reason is that we release a ByteBuffer before
actually persist the content into WAL file.
The problem maybe lead to several errors, for example, ArrayIndexOfOutBounds
when replaying WAL.
ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while
processing event RS_LOG_REPLAY
java.lang.ArrayIndexOutOfBoundsException: 18056
at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1365)
at org.apache.hadoop.hbase.KeyValue.getFamilyLength(KeyValue.java:1358)
at
org.apache.hadoop.hbase.PrivateCellUtil.matchingFamily(PrivateCellUtil.java:735)
at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:816)
at
org.apache.hadoop.hbase.wal.WALEdit.isMetaEditFamily(WALEdit.java:143)
at org.apache.hadoop.hbase.wal.WALEdit.isMetaEdit(WALEdit.java:148)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:297)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:195)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:100)
And may even cause segmentation fault and crash the JVM directly. You will see
a hs_err_pidXXX.log file and usually the problem is SIGSEGV.
The problem has been reported several times in the past and this time
Wellington Ramos Chevreuil provided the full logs and deeply analyzed the logs
so we can find the root cause. And Lijin Bin figured out that the problem may
only happen when Durability.ASYNC_WAL is used. Thanks to them.
The problem only effects the 2.x releases, all users are highly recommand to
upgrade to a release which has this fix in, especially that if you use
Durability.ASYNC_WAL.
> WAL corruption due to early DBBs re-use when Durability.ASYNC_WAL is used
> -------------------------------------------------------------------------
>
> Key: HBASE-22539
> URL: https://issues.apache.org/jira/browse/HBASE-22539
> Project: HBase
> Issue Type: Bug
> Components: rpc, wal
> Affects Versions: 2.2.0, 2.0.5, 2.1.5
> Reporter: Wellington Chevreuil
> Assignee: Duo Zhang
> Priority: Blocker
> Fix For: 3.0.0, 2.3.0, 2.0.6, 2.2.1, 2.1.6
>
> Attachments: HBASE-22539-UT.patch, HBASE-22539.branch-2.001.patch
>
>
> Summary
> We had been chasing a WAL corruption issue reported on one of our customers
> deployments running release 2.1.1 (CDH 6.1.0). After providing a custom
> modified jar with the extra sanity checks implemented by HBASE-21401 applied
> on some code points, plus additional debugging messages, we believe it is
> related to DirectByteBuffer usage, and Unsafe copy from offheap memory to
> on-heap array triggered
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java#L1157],
> such as when writing into a non ByteBufferWriter type, as done
> [here|https://github.com/apache/hbase/blob/branch-2.1/hbase-common/src/main/java/org/apache/hadoop/hbase/io/ByteBufferWriterOutputStream.java#L84].
> More details on the following comment.
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)