[ https://issues.apache.org/jira/browse/HBASE-27659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Dimiduk resolved HBASE-27659. ---------------------------------- Resolution: Fixed Pushed to branch-2.6+. Thanks a lot for the contribution [~hgromer] > Incremental backups should re-use splits from last full backup > -------------------------------------------------------------- > > Key: HBASE-27659 > URL: https://issues.apache.org/jira/browse/HBASE-27659 > Project: HBase > Issue Type: Improvement > Reporter: Bryan Beaudreault > Assignee: Hernan Gelaf-Romer > Priority: Major > Labels: pull-request-available > Fix For: 3.0.0-beta-2, 2.6.2 > > > All incremental backups require a previous full backup. Full backups use > snapshots + ExportSnapshot, which includes exporting the SnapshotManifest. > The SnapshotManifest includes all of the regions in the table during the > snapshot. > Incremental backups use WALPlayer to turn new HLogs since last backup into > HFiles. This uses HFileOutputFormat2, which writes HFiles along the split > boundaries of the tables at the time that it runs. > Active clusters may have regions split and merge over time, so the split > boundaries of incremental backup hfiles may not align to the original full > backup. This means we need to use MapReduceHFileSplitterJob during restore in > order to read all of the hfiles for all of the incremental backups and > re-split them based on the restored table. > * So let's say a cluster with regions A, B, C does a full backup. Data in > that backup will be segmented into those 3 regions. > * Over time the cluster splits and merges and we end up with totally > different regions D, E, F. An incremental backup occurs, and the data will be > segmented into those 3 regions.Later the cluster splits those 3 regions so we > end up with new regions G, H, I, J, K, L. Then next incremental backup goes > with that > When we go to restore this cluster, it'll pull the full backup and the 2 > incrementals. The full backup will get restored first, so the new table will > have regions A, B, C. Then all of the hfiles from the incrementals will be > combined together and run through MapReduceHFileSplitterJob. This will cause > all of those data files to get re-partitioned based on the A, B, C regions of > the newly restored table (based on the full backup). > This splitting process is expensive on a large cluster. We could skip it > entirely if incremental backups used the RegionInfos from the original full > backup SnapshotManifest as the splits for WALPlayer. Therefore, all > incremental backups will use the same splits as the original full backup. The > resulting hfiles could be directly bulkloaded without any split process, > reducing cost and time of restore. > One other benefit is that one could use the combination of a full backup + > all incremental backups as an input to their own mapreduce job. This > impossible now because all of the backups will have HFiles with different > start/end keys which don't align to a common set of splits for combining into > ClientSideRegionScanner. -- This message was sent by Atlassian Jira (v8.20.10#820010)