[
https://issues.apache.org/jira/browse/HBASE-28697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862475#comment-17862475
]
Dieter De Paepe commented on HBASE-28697:
-----------------------------------------
Hi Ray,
I agree with your assessment that this is a potential cause for data loss.
Those records should only be deleted after the backup has been completed.
(While reviewing that code, I noticed that the deletion of those records also
doesn't seem to take into account the possibility for multiple backup roots, I
logged HBASE-28706 for that.)
After browsing some of the code related to how the WALs are rolled and
backed-up, I think it should be possible to retry an actual incremental backup
after a failure.
> Incremental backups delete bulk loaded system table rows too early
> ------------------------------------------------------------------
>
> Key: HBASE-28697
> URL: https://issues.apache.org/jira/browse/HBASE-28697
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Ray Mattingly
> Priority: Major
>
> I've been thinking through the incremental backup order of operations, and I
> think we delete rows from the bulk loads system table too early and,
> consequently, make it possible to produce a "successful" incremental backup
> that is missing bulk loads.
> To summarize the steps here, starting in
> {{{}IncrementalTableBackupCilent#execute{}}}:
> # We take an incremental backup of the WALs generated since the last backup
> # We ensure any bulk loads done since the last backup are appropriately
> represented in the new backup by going through the system table and copying
> the appropriate files to the backup directory
> # We delete all of the system table rows which told us about these bulk loads
> # We generate a backup manifest and mark the backup as complete
> If we began deleting any of the system table rows regarding bulk loads, but
> fail in steps 3 and 4 before we are able to mark the backup as complete, then
> we'll be in a precarious spot. If we retry an incremental backup then it may
> succeed, but it would not know to persist the bulk loaded files for which we
> have already deleted system table references.
> We could consider this issue an extension or replacement of
> https://issues.apache.org/jira/browse/HBASE-28084 in some ways, depending on
> what solution we land on. I think that we could fix this specific issue by
> reordering the bulk load table cleanup, but there will always be gotchas like
> this. Maybe it is simpler to require that the next backup be a full backup
> after any incremental failure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)