Bryan Beaudreault created HBASE-27659:
-----------------------------------------

             Summary: Incremental backups should re-use splits from last full 
backup
                 Key: HBASE-27659
                 URL: https://issues.apache.org/jira/browse/HBASE-27659
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


All incremental backups require a previous full backup. Full backups use 
snapshots + ExportSnapshot, which includes exporting the SnapshotManifest. The 
SnapshotManifest includes all of the regions in the table during the snapshot.

Incremental backups use WALPlayer to turn new HLogs since last backup into 
HFiles. This uses HFileOutputFormat2, which writes HFiles along the split 
boundaries of the tables at the time that it runs.

Active clusters may have regions split and merge over time, so the split 
boundaries of incremental backup hfiles may not align to the original full 
backup. This means we need to use MapReduceHFileSplitterJob during restore in 
order to read all of the hfiles for all of the incremental backups and re-split 
them based on the restored table.
 * So let's say a cluster with regions A, B, C does a full backup. Data in that 
backup will be segmented into those 3 regions.
 * Over time the cluster splits and merges and we end up with totally different 
regions D, E, F. An incremental backup occurs, and the data will be segmented 
into those 3 regions.Later the cluster splits those 3 regions so we end up with 
new regions G, H, I, J, K, L.  Then next incremental backup goes with that

When we go to restore this cluster, it'll pull the full backup and the 2 
incrementals. The full backup will get restored first, so the new table will 
have regions A, B, C.  Then all of the hfiles from the incrementals will be 
combined together and run through MapReduceHFileSplitterJob. This will cause 
all of those data files to get re-partitioned based on the A, B, C regions of 
the newly restored table (based on the full backup).

This splitting process is expensive on a large cluster. We could skip it 
entirely if incremental backups used the RegionInfos from the original full 
backup SnapshotManifest as the splits for WALPlayer. Therefore, all incremental 
backups will use the same splits as the original full backup. The resulting 
hfiles could be directly bulkloaded without any split process, reducing cost 
and time of restore.

One other benefit is that one could use the combination of a full backup + all 
incremental backups as an input to their own mapreduce job. This impossible now 
because all of the backups will have HFiles with different start/end keys which 
don't align to a common set of splits for combining into 
ClientSideRegionScanner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to