*Existing Design:*

[image: image.png]

*Problem 1: *

With this design, Incremental and Full backup can't be run in parallel and
leading to degraded RPO's in case Full backup is of longer duration esp for
large tables.

Example:
Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes
and you are allowed to ship the remote backup with 800 Mbps. And you are
allowed to take Full Backups once in a week and rest of them should be
incremental backups

Shortcoming: With the above design, one can't run parallel backups and
whenever there is a full backup running (which takes roughly 25 hours) you
are not allowed to take incremental backups and that would be a breach in
your RPO.

*Proposed Solution: *Barring some critical sections such as modifying state
of the backup on meta tables, others can happen parallelly.
Leaving incremental backups to be able to run based on older successful
full / incremental backups and completion time of backup should be used
instead of start time of backup for ordering. I have not worked on the full
redesign, and will be doing so if this proposal seems acceptable for the
community.

*Problem 2:*

With one backup at a time, it fails easily for a multi-tenant system. This
poses following problems

   - Admins will not be able to achieve required RPO's for their tables
   because of dependence on other tenants present in the system. As one tenant
   doesn't have control over other tenants' table sizes and hence the duration
   of the backup
   - Management overhead of setting up a right sequence to achieve required
   RPO's for different tenants could be very hard.

*Proposed Solution: *Same as previous proposal

*Problem 3: *

Incremental backup works on WAL's and
org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's
are never cleaned up until the next backup (Full / Incremental) is taken.
This poses following problem

   - WAL's can grow unbounded in case there are transient problems like
   backup site facing issues or anything else until next backup scheduled goes
   successful

*Proposed Solution: *I can't think of anything better, but I see this can
be a potential problem. Also, one can force full backup if required WAL
files are missing for whatever other reasons not necessarily mentioned
above.

---
Mallikarjun

Reply via email to