[
https://issues.apache.org/jira/browse/HBASE-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057305#comment-18057305
]
Dieter De Paepe commented on HBASE-29800:
-----------------------------------------
In my opinion, the TS log bound map in the BackupInfo should be further
refactored (in future work) to only contain WAL info for the backup itself. So,
then it would be usable to derive the "oldest WAL file included in the
backups". So in that regards, there's no point in keeping the StartCode around.
If we would go a different path, it would keep the issue that the StartCode is
updated independently from the BackupInfo. It could be done in a way that is
correct, but it's something that's complex to reason about.
I've created a PR where I fix the bug as described above, while also removing
the StartCode. It does simplify a lot of things.
> WAL logs are unprotected during first full backup
> -------------------------------------------------
>
> Key: HBASE-29800
> URL: https://issues.apache.org/jira/browse/HBASE-29800
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Dieter De Paepe
> Priority: Major
> Labels: pull-request-available
>
> There is a small window during the creation of the first full backup in the
> first/only backup root where WAL logs might be eligible for deletion, which
> could lead to data loss for incremental backups in the following backups.
> Pseudo code for this scenario is as follows (see
> FullTableBackupClient#execute):
> {code:java}
> // This is our first backup. Let's put some marker to system table so that we
> can hold the
> // logs while we do the backup.
> backupManager.writeBackupStartCode(0L);
> // Roll the WALs
> BackupUtils.logRoll(...);
> snapshotAndCopyTables();
> backupManager.writeBackupStartCode(newStartCode);
> // Register the backupInfo as completed
> completeBackup(...);{code}
> The comment of the "0" backupStartCode suggests that it prevents WAL deletion
> until the backup is completed, but this is not the case.
> The component responsible for preventing WAL deletion for backups is
> BackupLogCleaner. While the log cleaner does read & use the backup start
> codes, it only does so for backups that are already completed:
> {code:java}
> // true means only include completed backups
> List<BackupInfo> backups = sysTable.getBackupHistory(true); {code}
> So the log cleaner will not even be aware of the backup root.
> I believe this means there is a risk of data loss in the following
> incremental backup when a table, after it has been snapshotted but before the
> backup is completed, performs a log roll and the log cleaner activates.
> Simplest fix is probably to have the log cleaner also use in-progress
> backupInfos to calculate the startCode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)