[
https://issues.apache.org/jira/browse/HBASE-25784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mallikarjun updated HBASE-25784:
--------------------------------
Attachment: proposed_design.png
> Support for Parallel Backups enabling multi tenancy with rsgroups
> -----------------------------------------------------------------
>
> Key: HBASE-25784
> URL: https://issues.apache.org/jira/browse/HBASE-25784
> Project: HBase
> Issue Type: Umbrella
> Components: backup&restore
> Reporter: Mallikarjun
> Assignee: Mallikarjun
> Priority: Major
> Labels: backup
> Attachments: existing_design.png, proposed_design.png
>
>
> *Existing Design*
>
> *Problem 1:*
> With this design, Incremental and Full backup can't be run in parallel and
> leading to degraded RPO's in case Full backup is of longer duration esp for
> large tables.
>
> Example:
> Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes
> and you are allowed to ship the remote backup with 800 Mbps. And you are
> allowed to take Full Backups once in a week and rest of them should be
> incremental backups
>
> Shortcoming: With the above design, one can't run parallel backups and
> whenever there is a full backup running (which takes roughly 25 hours) you
> are not allowed to take incremental backups and that would be a breach in
> your RPO.
>
> *Proposed Solution:* Barring some critical sections such as modifying state
> of the backup on meta tables, others can happen parallelly. Leaving
> incremental backups to be able to run based on older successful full /
> incremental backups and completion time of backup should be used instead of
> start time of backup for ordering. I have not worked on the full redesign,
> and will be doing so if this proposal seems acceptable for the community.
>
> *Problem 2:*
> With one backup at a time, it fails easily for a multi-tenant system. This
> poses following problems
> * Admins will not be able to achieve required RPO's for their tables because
> of dependence on other tenants present in the system. As one tenant doesn't
> have control over other tenants' table sizes and hence the duration of the
> backup
> * Management overhead of setting up a right sequence to achieve required
> RPO's for different tenants could be very hard.
> *Proposed Solution:* Same as previous proposal
>
> *Problem 3:*
> Incremental backup works on WAL's and
> org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are
> never cleaned up until the next backup (Full / Incremental) is taken. This
> poses following problem
> * WAL's can grow unbounded in case there are transient problems like backup
> site facing issues or anything else until next backup scheduled goes
> successful
> *Proposed Solution:* I can't think of anything better, but I see this can be
> a potential problem. Also, one can force full backup if required WAL files
> are missing for whatever other reasons not necessarily mentioned above.
>
> *Proposed Design.*
> !image-2021-06-03-16-34-34-957.png|width=324,height=416!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)