At NGData, we are using HBase backup as part of the backup procedure for our product. Besides HBase, some other components (HDFS, ZooKeeper, ...) are also backed up. Due to how our product works, there are some dependencies between these components, i.e. HBase should be backed up first, then ZooKeeper, then... To minimize the time between the backup for each component (i.e. to minimize data drift), we designed a phased approach in our backup procedure:
* a "record" phase, where all data relevant for a backup is captured. Eg, for HDFS this is a HDFS snapshot. * a "store" phase, where the captured data is moved to cloud storage. Eg, for HDFS, this is a DistCP of that snapshot This approach allows us to avoid any delay related to data transfer to the end of the backup procedure, meaning the time between data capture for all component backups is minimized. The HBase backup API currently doesn't support this kind of phase approach, though the steps that are executed certainly would allow this: * Record phase (full backup): roll WALs, snapshot tables * Store phase (full backup): snapshot copy, bulk load copy, updating metadata, terminating backup session * Record phase (incremental backup): roll WALs * Record phase (incremental backup): convert WALs to HFiles, bulk load copy, HFile copy, metadata updates, terminating backup session As this seems like a general use-case, I would like to suggest refactoring the HBase backup API to allow this kind of 2-phase approach. CLI usage can remain unchanged. Before logging any ticket about this, I wanted to hear the community's thoughts about this. Unfortunately, I can't promise we will be available to actually spend time on this in the short term, but I'd rather have a plan of attack ready once we (or someone else) does have the time. Regards, Dieter