[ 
https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758582#comment-16758582
 ] 

Sergey Shelukhin edited comment on HBASE-21817 at 2/1/19 7:01 PM:
------------------------------------------------------------------

Currently failure to split log in this case results in RSes crashing with 
various errors, generally about array offsets, and regions being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I 
can add the check.
I wonder if a better way would be to write the corrupted records to a separate 
WAL, and only keep the regions that have corrupted record offline, not all the 
regions in the WAL. That would be a bigger change though to handle it 
gracefully on RS as well as master. Then the admin can keep or delete the file 
when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging 
so we don't really need it. 


was (Author: sershe):
Currently failure to split log in this case results in RSes crashing with 
various errors, generally about array offsets, and region being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I 
can add the check.
I wonder if a better way would be to write the corrupted records to a separate 
WAL, and only keep the regions that have corrupted record offline, not all the 
regions in the WAL. That would be a bigger change though to handle it 
gracefully on RS as well as master. Then the admin can keep or delete the file 
when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging 
so we don't really need it. 

> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
>                 Key: HBASE-21817
>                 URL: https://issues.apache.org/jira/browse/HBASE-21817
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21817.patch
>
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now 
> I'm going to mitigate it by skipping such records. Given that this record is 
> bogus, and the lengths are intact, for this scenario it's safe to do so. 
> However, it's possible I guess to have a bug where skipping such record would 
> lead to data loss. Regardless, failure to split the WAL will lead to even 
> more data loss in this case so it should be ok to handle errors where the 
> structure is correct but cells are corrupted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to