[ 
https://issues.apache.org/jira/browse/HBASE-28697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862475#comment-17862475
 ] 

Dieter De Paepe commented on HBASE-28697:
-----------------------------------------

Hi Ray,

I agree with your assessment that this is a potential cause for data loss. 
Those records should only be deleted after the backup has been completed. 
(While reviewing that code, I noticed that the deletion of those records also 
doesn't seem to take into account the possibility for multiple backup roots, I 
logged HBASE-28706 for that.)

After browsing some of the code related to how the WALs are rolled and 
backed-up, I think it should be possible to retry an actual incremental backup 
after a failure.

> Incremental backups delete bulk loaded system table rows too early
> ------------------------------------------------------------------
>
>                 Key: HBASE-28697
>                 URL: https://issues.apache.org/jira/browse/HBASE-28697
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Ray Mattingly
>            Priority: Major
>
> I've been thinking through the incremental backup order of operations, and I 
> think we delete rows from the bulk loads system table too early and, 
> consequently, make it possible to produce a "successful" incremental backup 
> that is missing bulk loads.
> To summarize the steps here, starting in 
> {{{}IncrementalTableBackupCilent#execute{}}}:
>  # We take an incremental backup of the WALs generated since the last backup
>  # We ensure any bulk loads done since the last backup are appropriately 
> represented in the new backup by going through the system table and copying 
> the appropriate files to the backup directory
>  # We delete all of the system table rows which told us about these bulk loads
>  # We generate a backup manifest and mark the backup as complete
> If we began deleting any of the system table rows regarding bulk loads, but 
> fail in steps 3 and 4 before we are able to mark the backup as complete, then 
> we'll be in a precarious spot. If we retry an incremental backup then it may 
> succeed, but it would not know to persist the bulk loaded files for which we 
> have already deleted system table references.
> We could consider this issue an extension or replacement of 
> https://issues.apache.org/jira/browse/HBASE-28084 in some ways, depending on 
> what solution we land on. I think that we could fix this specific issue by 
> reordering the bulk load table cleanup, but there will always be gotchas like 
> this. Maybe it is simpler to require that the next backup be a full backup 
> after any incremental failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to