[
https://issues.apache.org/jira/browse/HBASE-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011816#comment-18011816
]
Vinayak Hegde commented on HBASE-27659:
---------------------------------------
[~hgromer] Thanks for fixing the issue.
As part of https://issues.apache.org/jira/browse/HBASE-29484, we're updating
the backup and restore documentation to reflect the latest changes.
Since this Jira also touches the backup and restore components, could you
please review the changes and check if any updates to the documentation are
needed? If so, kindly create a sub-task under
https://issues.apache.org/jira/browse/HBASE-29484 detailing the required
documentation updates. You’re welcome to assign it to yourself and work on it,
or we’d be happy to take it up as well.
Thanks!
> Incremental backups should re-use splits from last full backup
> --------------------------------------------------------------
>
> Key: HBASE-27659
> URL: https://issues.apache.org/jira/browse/HBASE-27659
> Project: HBase
> Issue Type: Improvement
> Reporter: Bryan Beaudreault
> Assignee: Hernan Gelaf-Romer
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0-beta-2, 2.6.2
>
>
> All incremental backups require a previous full backup. Full backups use
> snapshots + ExportSnapshot, which includes exporting the SnapshotManifest.
> The SnapshotManifest includes all of the regions in the table during the
> snapshot.
> Incremental backups use WALPlayer to turn new HLogs since last backup into
> HFiles. This uses HFileOutputFormat2, which writes HFiles along the split
> boundaries of the tables at the time that it runs.
> Active clusters may have regions split and merge over time, so the split
> boundaries of incremental backup hfiles may not align to the original full
> backup. This means we need to use MapReduceHFileSplitterJob during restore in
> order to read all of the hfiles for all of the incremental backups and
> re-split them based on the restored table.
> * So let's say a cluster with regions A, B, C does a full backup. Data in
> that backup will be segmented into those 3 regions.
> * Over time the cluster splits and merges and we end up with totally
> different regions D, E, F. An incremental backup occurs, and the data will be
> segmented into those 3 regions.Later the cluster splits those 3 regions so we
> end up with new regions G, H, I, J, K, L. Then next incremental backup goes
> with that
> When we go to restore this cluster, it'll pull the full backup and the 2
> incrementals. The full backup will get restored first, so the new table will
> have regions A, B, C. Then all of the hfiles from the incrementals will be
> combined together and run through MapReduceHFileSplitterJob. This will cause
> all of those data files to get re-partitioned based on the A, B, C regions of
> the newly restored table (based on the full backup).
> This splitting process is expensive on a large cluster. We could skip it
> entirely if incremental backups used the RegionInfos from the original full
> backup SnapshotManifest as the splits for WALPlayer. Therefore, all
> incremental backups will use the same splits as the original full backup. The
> resulting hfiles could be directly bulkloaded without any split process,
> reducing cost and time of restore.
> One other benefit is that one could use the combination of a full backup +
> all incremental backups as an input to their own mapreduce job. This
> impossible now because all of the backups will have HFiles with different
> start/end keys which don't align to a common set of splits for combining into
> ClientSideRegionScanner.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)