Re: [PR] HBASE-29497 Mention HFiles for incremental backups [hbase]

via GitHub Fri, 22 Aug 2025 10:59:55 -0700


DieterDP-ng commented on code in PR #7216:
URL: https://github.com/apache/hbase/pull/7216#discussion_r2294306835



##########
src/main/asciidoc/_chapters/backup_restore.adoc:
##########
@@ -804,16 +804,18 @@ providing a comparable level of security. This is a 
manual step which users *mus
 [[br.technical.details]]
 == Technical Details of Incremental Backup and Restore
 
-HBase incremental backups enable more efficient capture of HBase table images 
than previous attempts at serial backup and restore
-solutions, such as those that only used HBase Export and Import APIs. 
Incremental backups use Write Ahead Logs (WALs) to capture
-the data changes since the previous backup was created. A WAL roll (create new 
WALs) is executed across all RegionServers to track
-the WALs that need to be in the backup.
+HBase incremental backups enable more efficient capture of HBase table images 
than previous attempts at serial backup and restore solutions, such as those 
that only used HBase Export and Import APIs.
+Incremental backups use Write Ahead Logs (WALs) to capture the data changes 
since the previous backup was created. A WAL roll (create new WALs) is executed 
across all RegionServers to track the WALs that need to be in the backup.
+In addition to WALs, incremental backups also track bulk-loaded HFiles for 
tables under backup.
 
-After the incremental backup image is created, the source backup files usually 
are on same node as the data source. A process similar
-to the DistCp (distributed copy) tool is used to move the source backup files 
to the target file systems. When a table restore operation
-starts, a two-step process is initiated. First, the full backup is restored 
from the full backup image. Second, all WAL files from
-incremental backups between the last full backup and the incremental backup 
being restored are converted to HFiles, which the HBase
-Bulk Load utility automatically imports as restored data in the table.
+Incremental backup gathers all WAL files generated since the last backup from 
the source cluster,
+converts them to HFiles in a `.tmp` directory under the `BACKUP_ROOT`, and 
then moves these HFiles to their final location under the backup root directory 
to form the backup image.
+It also reads bulk load records from the backup system table, forms the paths 
for the corresponding bulk-loaded HFiles, and copies those files to the backup 
destination.
+This ensures bulk-loaded files are preserved and not deleted by cleaner chores 
before the backup completes.

Review Comment:
   I'd rephrase this: `Bulk-loaded files are preserved (not deleted by cleaner 
chores) until they've been included in a backup (for each backup root).`



##########
src/main/asciidoc/_chapters/backup_restore.adoc:
##########
@@ -872,8 +874,10 @@ data at the full 80MB/s and `-w` is used to limit the job 
from spawning 16 worke
 
 Like we did for full backups, we have to understand the incremental backup 
process to approximate its runtime and cost.
 
-* Identify new write-ahead logs since last full or incremental backup: 
negligible. Apriori knowledge from the backup system table(s).
+* Identify new write-ahead logs since the last full or incremental backup: 
negligible. Apriori knowledge from the backup system table(s).
 * Read, filter, and write "minimized" HFiles equivalent to the WALs: dominated 
by the speed of writing data. Relative to write speed of HDFS.
+* Read bulk load records from the backup system table, form the paths for 
bulk-loaded HFiles, and copy them to the backup destination.
+These entries in the backup system table are not cleaned until the backup is 
marked complete to ensure the cleaner chore does not delete the files.

Review Comment:
   This line can be scrapped in my opinion, it's already mentioned above, and 
it makes less sense to mention this in a "performance" section.



##########
src/main/asciidoc/_chapters/backup_restore.adoc:
##########
@@ -872,8 +874,10 @@ data at the full 80MB/s and `-w` is used to limit the job 
from spawning 16 worke
 
 Like we did for full backups, we have to understand the incremental backup 
process to approximate its runtime and cost.
 
-* Identify new write-ahead logs since last full or incremental backup: 
negligible. Apriori knowledge from the backup system table(s).
+* Identify new write-ahead logs since the last full or incremental backup: 
negligible. Apriori knowledge from the backup system table(s).
 * Read, filter, and write "minimized" HFiles equivalent to the WALs: dominated 
by the speed of writing data. Relative to write speed of HDFS.
+* Read bulk load records from the backup system table, form the paths for 
bulk-loaded HFiles, and copy them to the backup destination.

Review Comment:
   @hgromer - I think you can contribute a summary of HBASE-27659 to this part 
of the HBase docs.



##########
src/main/asciidoc/_chapters/backup_restore.adoc:
##########
@@ -804,16 +804,18 @@ providing a comparable level of security. This is a 
manual step which users *mus
 [[br.technical.details]]
 == Technical Details of Incremental Backup and Restore
 
-HBase incremental backups enable more efficient capture of HBase table images 
than previous attempts at serial backup and restore
-solutions, such as those that only used HBase Export and Import APIs. 
Incremental backups use Write Ahead Logs (WALs) to capture
-the data changes since the previous backup was created. A WAL roll (create new 
WALs) is executed across all RegionServers to track
-the WALs that need to be in the backup.
+HBase incremental backups enable more efficient capture of HBase table images 
than previous attempts at serial backup and restore solutions, such as those 
that only used HBase Export and Import APIs.
+Incremental backups use Write Ahead Logs (WALs) to capture the data changes 
since the previous backup was created. A WAL roll (create new WALs) is executed 
across all RegionServers to track the WALs that need to be in the backup.
+In addition to WALs, incremental backups also track bulk-loaded HFiles for 
tables under backup.
 
-After the incremental backup image is created, the source backup files usually 
are on same node as the data source. A process similar
-to the DistCp (distributed copy) tool is used to move the source backup files 
to the target file systems. When a table restore operation
-starts, a two-step process is initiated. First, the full backup is restored 
from the full backup image. Second, all WAL files from
-incremental backups between the last full backup and the incremental backup 
being restored are converted to HFiles, which the HBase
-Bulk Load utility automatically imports as restored data in the table.
+Incremental backup gathers all WAL files generated since the last backup from 
the source cluster,
+converts them to HFiles in a `.tmp` directory under the `BACKUP_ROOT`, and 
then moves these HFiles to their final location under the backup root directory 
to form the backup image.
+It also reads bulk load records from the backup system table, forms the paths 
for the corresponding bulk-loaded HFiles, and copies those files to the backup 
destination.
+This ensures bulk-loaded files are preserved and not deleted by cleaner chores 
before the backup completes.
+A process similar to the DistCp (distributed copy) tool is used to move the 
backup files to the target file systems.

Review Comment:
   Nit: `file system`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-29497 Mention HFiles for incremental backups [hbase]

Reply via email to