[jira] [Updated] (HBASE-21817) skip records with corrupted cells in WAL splitting

Sergey Shelukhin (JIRA) Thu, 31 Jan 2019 18:17:23 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HBASE-21817:
-------------------------------------
    Description: 
See HBASE-21601 for context.
I looked at the code a bit but it will take a while to understand, so for now 
I'm going to mitigate it by skipping such records. Given that this record is 
bogus, and the lengths are intact, for this scenario it's safe to do so. 
However, it's possible I guess to have a bug where skipping such record would 
lead to data loss. Regardless, failure to split the WAL will lead to even more 
data loss in this case so it should be ok to handle errors where the structure 
is correct but cells are corrupted.

  was:
{noformat}
2018-12-13 17:01:12,208 ERROR [RS_LOG_REPLAY_OPS-regionserver/...] 
executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
java.lang.RuntimeException: java.lang.NegativeArraySizeException
        at 
org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:846)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$OutputSink.finishWriting(WALSplitter.java:1203)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(WALSplitter.java:1267)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:349)
        at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:196)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:178)
        at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.lambda$new$0(SplitLogWorker.java:90)
        at 
org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
        at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:113)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.filterCellByStore(WALSplitter.java:1542)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1586)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1560)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1085)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1077)
        at 
org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1047)
{noformat}

Unfortunately I cannot share the file.
The issue appears to be straightforward - for whatever reason the family length 
is negative. Not sure how such a cell got created, I suspect the file was 
corrupted.
{code}
byte[] output = new byte[cell.getFamilyLength()];
{code}


> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
>                 Key: HBASE-21817
>                 URL: https://issues.apache.org/jira/browse/HBASE-21817
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now 
> I'm going to mitigate it by skipping such records. Given that this record is 
> bogus, and the lengths are intact, for this scenario it's safe to do so. 
> However, it's possible I guess to have a bug where skipping such record would 
> lead to data loss. Regardless, failure to split the WAL will lead to even 
> more data loss in this case so it should be ok to handle errors where the 
> structure is correct but cells are corrupted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21817) skip records with corrupted cells in WAL splitting

Reply via email to