[ 
https://issues.apache.org/jira/browse/HBASE-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk resolved HBASE-27659.
----------------------------------
    Resolution: Fixed

Pushed to branch-2.6+. Thanks a lot for the contribution [~hgromer]

> Incremental backups should re-use splits from last full backup
> --------------------------------------------------------------
>
>                 Key: HBASE-27659
>                 URL: https://issues.apache.org/jira/browse/HBASE-27659
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Hernan Gelaf-Romer
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0-beta-2, 2.6.2
>
>
> All incremental backups require a previous full backup. Full backups use 
> snapshots + ExportSnapshot, which includes exporting the SnapshotManifest. 
> The SnapshotManifest includes all of the regions in the table during the 
> snapshot.
> Incremental backups use WALPlayer to turn new HLogs since last backup into 
> HFiles. This uses HFileOutputFormat2, which writes HFiles along the split 
> boundaries of the tables at the time that it runs.
> Active clusters may have regions split and merge over time, so the split 
> boundaries of incremental backup hfiles may not align to the original full 
> backup. This means we need to use MapReduceHFileSplitterJob during restore in 
> order to read all of the hfiles for all of the incremental backups and 
> re-split them based on the restored table.
>  * So let's say a cluster with regions A, B, C does a full backup. Data in 
> that backup will be segmented into those 3 regions.
>  * Over time the cluster splits and merges and we end up with totally 
> different regions D, E, F. An incremental backup occurs, and the data will be 
> segmented into those 3 regions.Later the cluster splits those 3 regions so we 
> end up with new regions G, H, I, J, K, L.  Then next incremental backup goes 
> with that
> When we go to restore this cluster, it'll pull the full backup and the 2 
> incrementals. The full backup will get restored first, so the new table will 
> have regions A, B, C.  Then all of the hfiles from the incrementals will be 
> combined together and run through MapReduceHFileSplitterJob. This will cause 
> all of those data files to get re-partitioned based on the A, B, C regions of 
> the newly restored table (based on the full backup).
> This splitting process is expensive on a large cluster. We could skip it 
> entirely if incremental backups used the RegionInfos from the original full 
> backup SnapshotManifest as the splits for WALPlayer. Therefore, all 
> incremental backups will use the same splits as the original full backup. The 
> resulting hfiles could be directly bulkloaded without any split process, 
> reducing cost and time of restore.
> One other benefit is that one could use the combination of a full backup + 
> all incremental backups as an input to their own mapreduce job. This 
> impossible now because all of the backups will have HFiles with different 
> start/end keys which don't align to a common set of splits for combining into 
> ClientSideRegionScanner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to