[
https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758582#comment-16758582
]
Sergey Shelukhin edited comment on HBASE-21817 at 2/1/19 7:01 PM:
------------------------------------------------------------------
Currently failure to split log in this case results in RSes crashing with
various errors, generally about array offsets, and regions being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I
can add the check.
I wonder if a better way would be to write the corrupted records to a separate
WAL, and only keep the regions that have corrupted record offline, not all the
regions in the WAL. That would be a bigger change though to handle it
gracefully on RS as well as master. Then the admin can keep or delete the file
when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging
so we don't really need it.
was (Author: sershe):
Currently failure to split log in this case results in RSes crashing with
various errors, generally about array offsets, and region being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I
can add the check.
I wonder if a better way would be to write the corrupted records to a separate
WAL, and only keep the regions that have corrupted record offline, not all the
regions in the WAL. That would be a bigger change though to handle it
gracefully on RS as well as master. Then the admin can keep or delete the file
when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging
so we don't really need it.
> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
> Key: HBASE-21817
> URL: https://issues.apache.org/jira/browse/HBASE-21817
> Project: HBase
> Issue Type: Bug
> Components: wal
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Critical
> Attachments: HBASE-21817.patch
>
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now
> I'm going to mitigate it by skipping such records. Given that this record is
> bogus, and the lengths are intact, for this scenario it's safe to do so.
> However, it's possible I guess to have a bug where skipping such record would
> lead to data loss. Regardless, failure to split the WAL will lead to even
> more data loss in this case so it should be ok to handle errors where the
> structure is correct but cells are corrupted.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)