Hi Dieter,

I don't see a problem with making the individual steps accessible from some
external "driver". My only requirement is that there's a clear interface
between each step so that whatever driver implementations exist don't get
caught with divergent semantics. In the current state, the only driver is
the one that we ship with the project, so there's only one place where such
semantics must be correct. Because this is an area where dataloss is
possible, and dataloss is a reputation-killer for a data storage system
like ours, we must tread carefully.

Thanks,
Nick

On Mon, Jul 15, 2024 at 5:27 PM Dieter De Paepe <diete...@ngdata.com.invalid>
wrote:

> At NGData, we are using HBase backup as part of the backup procedure for
> our product. Besides HBase, some other components (HDFS, ZooKeeper, ...)
> are also backed up.
> Due to how our product works, there are some dependencies between these
> components, i.e. HBase should be backed up first, then ZooKeeper, then...
> To minimize the time between the backup for each component (i.e. to
> minimize data drift), we designed a phased approach in our backup procedure:
>
>   *
> a "record" phase, where all data relevant for a backup is captured. Eg,
> for HDFS this is a HDFS snapshot.
>   *
> a "store" phase, where the captured data is moved to cloud storage. Eg,
> for HDFS, this is a DistCP of that snapshot
>
> This approach allows us to avoid any delay related to data transfer to the
> end of the backup procedure, meaning the time between data capture for all
> component backups is minimized.
>
> The HBase backup API currently doesn't support this kind of phase
> approach, though the steps that are executed certainly would allow this:
>
>   *
> Record phase (full backup): roll WALs, snapshot tables
>   *
> Store phase (full backup): snapshot copy, bulk load copy, updating
> metadata, terminating backup session
>   *
> Record phase (incremental backup): roll WALs
>   *
> Record phase (incremental backup): convert WALs to HFiles, bulk load copy,
> HFile copy, metadata updates, terminating backup session
>
> As this seems like a general use-case, I would like to suggest refactoring
> the HBase backup API to allow this kind of 2-phase approach. CLI usage can
> remain unchanged.
>
> Before logging any ticket about this, I wanted to hear the community's
> thoughts about this.
> Unfortunately, I can't promise we will be available to actually spend time
> on this in the short term, but I'd rather have a plan of attack ready once
> we (or someone else) does have the time.
>
> Regards,
> Dieter
>

Reply via email to