At NGData, we are using HBase backup as part of the backup procedure for our 
product. Besides HBase, some other components (HDFS, ZooKeeper, ...) are also 
backed up.
Due to how our product works, there are some dependencies between these 
components, i.e. HBase should be backed up first, then ZooKeeper, then...
To minimize the time between the backup for each component (i.e. to minimize 
data drift), we designed a phased approach in our backup procedure:

  *
a "record" phase, where all data relevant for a backup is captured. Eg, for 
HDFS this is a HDFS snapshot.
  *
a "store" phase, where the captured data is moved to cloud storage. Eg, for 
HDFS, this is a DistCP of that snapshot

This approach allows us to avoid any delay related to data transfer to the end 
of the backup procedure, meaning the time between data capture for all 
component backups is minimized.

The HBase backup API currently doesn't support this kind of phase approach, 
though the steps that are executed certainly would allow this:

  *
Record phase (full backup): roll WALs, snapshot tables
  *
Store phase (full backup): snapshot copy, bulk load copy, updating metadata, 
terminating backup session
  *
Record phase (incremental backup): roll WALs
  *
Record phase (incremental backup): convert WALs to HFiles, bulk load copy, 
HFile copy, metadata updates, terminating backup session

As this seems like a general use-case, I would like to suggest refactoring the 
HBase backup API to allow this kind of 2-phase approach. CLI usage can remain 
unchanged.

Before logging any ticket about this, I wanted to hear the community's thoughts 
about this.
Unfortunately, I can't promise we will be available to actually spend time on 
this in the short term, but I'd rather have a plan of attack ready once we (or 
someone else) does have the time.

Regards,
Dieter

Reply via email to