[
https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin updated HBASE-21817:
-------------------------------------
Description:
See HBASE-21601 for context.
I looked at the code a bit but it will take a while to understand, so for now
I'm going to mitigate it by skipping such records. Given that this record is
bogus, and the lengths are intact, for this scenario it's safe to do so.
However, it's possible I guess to have a bug where skipping such record would
lead to data loss. Regardless, failure to split the WAL will lead to even more
data loss in this case so it should be ok to handle errors where the structure
is correct but cells are corrupted.
was:
{noformat}
2018-12-13 17:01:12,208 ERROR [RS_LOG_REPLAY_OPS-regionserver/...]
executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
java.lang.RuntimeException: java.lang.NegativeArraySizeException
at
org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:846)
at
org.apache.hadoop.hbase.wal.WALSplitter$OutputSink.finishWriting(WALSplitter.java:1203)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(WALSplitter.java:1267)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:349)
at
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:196)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:178)
at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.lambda$new$0(SplitLogWorker.java:90)
at
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:113)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.filterCellByStore(WALSplitter.java:1542)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1586)
at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1560)
at
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1085)
at
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1077)
at
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1047)
{noformat}
Unfortunately I cannot share the file.
The issue appears to be straightforward - for whatever reason the family length
is negative. Not sure how such a cell got created, I suspect the file was
corrupted.
{code}
byte[] output = new byte[cell.getFamilyLength()];
{code}
> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
> Key: HBASE-21817
> URL: https://issues.apache.org/jira/browse/HBASE-21817
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Critical
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now
> I'm going to mitigate it by skipping such records. Given that this record is
> bogus, and the lengths are intact, for this scenario it's safe to do so.
> However, it's possible I guess to have a bug where skipping such record would
> lead to data loss. Regardless, failure to split the WAL will lead to even
> more data loss in this case so it should be ok to handle errors where the
> structure is correct but cells are corrupted.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)