Bryan Beaudreault created HBASE-27659:
-----------------------------------------
Summary: Incremental backups should re-use splits from last full
backup
Key: HBASE-27659
URL: https://issues.apache.org/jira/browse/HBASE-27659
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
All incremental backups require a previous full backup. Full backups use
snapshots + ExportSnapshot, which includes exporting the SnapshotManifest. The
SnapshotManifest includes all of the regions in the table during the snapshot.
Incremental backups use WALPlayer to turn new HLogs since last backup into
HFiles. This uses HFileOutputFormat2, which writes HFiles along the split
boundaries of the tables at the time that it runs.
Active clusters may have regions split and merge over time, so the split
boundaries of incremental backup hfiles may not align to the original full
backup. This means we need to use MapReduceHFileSplitterJob during restore in
order to read all of the hfiles for all of the incremental backups and re-split
them based on the restored table.
* So let's say a cluster with regions A, B, C does a full backup. Data in that
backup will be segmented into those 3 regions.
* Over time the cluster splits and merges and we end up with totally different
regions D, E, F. An incremental backup occurs, and the data will be segmented
into those 3 regions.Later the cluster splits those 3 regions so we end up with
new regions G, H, I, J, K, L. Then next incremental backup goes with that
When we go to restore this cluster, it'll pull the full backup and the 2
incrementals. The full backup will get restored first, so the new table will
have regions A, B, C. Then all of the hfiles from the incrementals will be
combined together and run through MapReduceHFileSplitterJob. This will cause
all of those data files to get re-partitioned based on the A, B, C regions of
the newly restored table (based on the full backup).
This splitting process is expensive on a large cluster. We could skip it
entirely if incremental backups used the RegionInfos from the original full
backup SnapshotManifest as the splits for WALPlayer. Therefore, all incremental
backups will use the same splits as the original full backup. The resulting
hfiles could be directly bulkloaded without any split process, reducing cost
and time of restore.
One other benefit is that one could use the combination of a full backup + all
incremental backups as an input to their own mapreduce job. This impossible now
because all of the backups will have HFiles with different start/end keys which
don't align to a common set of splits for combining into
ClientSideRegionScanner.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)