[jira] [Comment Edited] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

Bryan Beaudreault (Jira) Mon, 25 Mar 2024 11:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830637#comment-17830637
 ]


Bryan Beaudreault edited comment on HBASE-28456 at 3/25/24 6:47 PM:
--------------------------------------------------------------------

Ok I figured this out – there are 2 things going on:
 # We need to enable "hbase.mapreduce.hfileoutputformat.extendedcell.enabled" 
in the backup/restore jobs. This ensure that sequenceId gets (de)serialized 
when it exists on the output KeyValue. This fixes the test case when normal 
puts are used.
 # Further, cells in a bulkload do not have a sequenceId. When a bulkload file 
is commited, it has a {{_SeqId_<num>}} value appended to the filename which is 
the memstore ts at the time of the commit. Within the RegionServer, when a 
bulkloaded file is opened for reading, the SeqId value is parsed from the 
filename and set onto the HStoreFile. When a scan comes in, the 
StoreFileScanner handles propagating that sequenceId onto the Cells returned. 
We need to add similar handling when a bulkloaded file is read through 
HFileInputFormat.

With these changes, your test succeeds. I'm going to work on formulating them 
into a PR.


was (Author: bbeaudreault):
Ok I figured this out – there are 2 things going on:
 # We need to enable "hbase.mapreduce.hfileoutputformat.extendedcell.enabled" 
in the backup/restore jobs. This ensure that sequenceId gets (de)serialized 
when it exists on the output KeyValue. This fixes the test case when normal 
puts are used.
 # Further, cells in a bulkload do not have a sequenceId. When a bulkload file 
is commited, it has a _SeqId_<num> value appended to the filename which is the 
memstore ts at the time of the commit. Within the RegionServer, when a 
bulkloaded file is opened for reading, the SeqId value is parsed from the 
filename and set onto the HStoreFile. When a scan comes in, the 
StoreFileScanner handles propagating that sequenceId onto the Cells returned. 
We need to add similar handling when a bulkloaded file is read through 
HFileInputFormat.

With these changes, your test succeeds. I'm going to work on formulating them 
into a PR.

> HBase Restore restores old data if data for the same timestamp is in 
> different hfiles
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-28456
>                 URL: https://issues.apache.org/jira/browse/HBASE-28456
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&amp;restore
>    Affects Versions: 2.6.0, 3.0.0
>            Reporter: Ruben Van Wanzeele
>            Assignee: Bryan Beaudreault
>            Priority: Blocker
>         Attachments: 
> ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java
>
>
> The restore brings back 'old' data when executing restore.
> It feels like the hfile sequence id is not respected during the restore.
> See testing code attached. The workaround solution is to trigger major 
> compaction before doing the backup (not really feasible for daily backups)
> We didn't investigate this yet, but this might also impact the merge of 
> multiple incremental backups (since that follows a similar code path merging 
> hfiles).
> This currently blocks our support for HBase backup and restore.
> Willing to participate in a solution if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

Reply via email to