[ 
https://issues.apache.org/jira/browse/HBASE-25784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mallikarjun updated HBASE-25784:
--------------------------------
    Description: 
*Existing Design*

*!image-2021-06-03-16-33-59-282.png|width=292,height=408!*

*Problem 1:* 
 With this design, Incremental and Full backup can't be run in parallel and 
leading to degraded RPO's in case Full backup is of longer duration esp for 
large tables.
  
 Example: 
 Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes 
and you are allowed to ship the remote backup with 800 Mbps. And you are 
allowed to take Full Backups once in a week and rest of them should be 
incremental backups
  
 Shortcoming: With the above design, one can't run parallel backups and 
whenever there is a full backup running (which takes roughly 25 hours) you are 
not allowed to take incremental backups and that would be a breach in your RPO. 
  
 *Proposed Solution:* Barring some critical sections such as modifying state of 
the backup on meta tables, others can happen parallelly. Leaving incremental 
backups to be able to run based on older successful full / incremental backups 
and completion time of backup should be used instead of start time of backup 
for ordering. I have not worked on the full redesign, and will be doing so if 
this proposal seems acceptable for the community.
  
 *Problem 2:*
 With one backup at a time, it fails easily for a multi-tenant system. This 
poses following problems
 * Admins will not be able to achieve required RPO's for their tables because 
of dependence on other tenants present in the system. As one tenant doesn't 
have control over other tenants' table sizes and hence the duration of the 
backup
 * Management overhead of setting up a right sequence to achieve required RPO's 
for different tenants could be very hard.

*Proposed Solution:* Same as previous proposal
  
 *Problem 3:* 
 Incremental backup works on WAL's and 
org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are 
never cleaned up until the next backup (Full / Incremental) is taken. This 
poses following problem
 * WAL's can grow unbounded in case there are transient problems like backup 
site facing issues or anything else until next backup scheduled goes successful
 *Proposed Solution:* I can't think of anything better, but I see this can be a 
potential problem. Also, one can force full backup if required WAL files are 
missing for whatever other reasons not necessarily mentioned above. 
  

*Proposed Design.*

!image-2021-06-03-16-34-34-957.png|width=324,height=416!

  was:
*Existing Design*

*!image-2021-06-03-16-33-59-282.png|width=292,height=408!*

*Problem 1:* 
 With this design, Incremental and Full backup can't be run in parallel and 
leading to degraded RPO's in case Full backup is of longer duration esp for 
large tables.
  
 Example: 
 Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes 
and you are allowed to ship the remote backup with 800 Mbps. And you are 
allowed to take Full Backups once in a week and rest of them should be 
incremental backups
  
 Shortcoming: With the above design, one can't run parallel backups and 
whenever there is a full backup running (which takes roughly 25 hours) you are 
not allowed to take incremental backups and that would be a breach in your RPO. 
  
 *Proposed Solution:* Barring some critical sections such as modifying state of 
the backup on meta tables, others can happen parallelly. Leaving incremental 
backups to be able to run based on older successful full / incremental backups 
and completion time of backup should be used instead of start time of backup 
for ordering. I have not worked on the full redesign, and will be doing so if 
this proposal seems acceptable for the community.
  
 *Problem 2:*
 With one backup at a time, it fails easily for a multi-tenant system. This 
poses following problems
 * Admins will not be able to achieve required RPO's for their tables because 
of dependence on other tenants present in the system. As one tenant doesn't 
have control over other tenants' table sizes and hence the duration of the 
backup
 * Management overhead of setting up a right sequence to achieve required RPO's 
for different tenants could be very hard.

*Proposed Solution:* Same as previous proposal
  
 *Problem 3:* 
 Incremental backup works on WAL's and 
org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are 
never cleaned up until the next backup (Full / Incremental) is taken. This 
poses following problem
 * WAL's can grow unbounded in case there are transient problems like backup 
site facing issues or anything else until next backup scheduled goes successful
 *Proposed Solution:* I can't think of anything better, but I see this can be a 
potential problem. Also, one can force full backup if required WAL files are 
missing for whatever other reasons not necessarily mentioned above. 
  

Proposed Design.

!https://i.ibb.co/vVV1BTs/Backup-Activity-Diagram.png|width=322,height=414!


> Support for Parallel Backups enabling multi tenancy with rsgroups
> -----------------------------------------------------------------
>
>                 Key: HBASE-25784
>                 URL: https://issues.apache.org/jira/browse/HBASE-25784
>             Project: HBase
>          Issue Type: Umbrella
>          Components: backup&restore
>            Reporter: Mallikarjun
>            Assignee: Mallikarjun
>            Priority: Major
>              Labels: backup
>         Attachments: image-2021-06-03-16-33-59-282.png, 
> image-2021-06-03-16-34-34-957.png
>
>
> *Existing Design*
> *!image-2021-06-03-16-33-59-282.png|width=292,height=408!*
> *Problem 1:* 
>  With this design, Incremental and Full backup can't be run in parallel and 
> leading to degraded RPO's in case Full backup is of longer duration esp for 
> large tables.
>   
>  Example: 
>  Expectation: Say you have a big table with 10 TB and your RPO is 60 minutes 
> and you are allowed to ship the remote backup with 800 Mbps. And you are 
> allowed to take Full Backups once in a week and rest of them should be 
> incremental backups
>   
>  Shortcoming: With the above design, one can't run parallel backups and 
> whenever there is a full backup running (which takes roughly 25 hours) you 
> are not allowed to take incremental backups and that would be a breach in 
> your RPO. 
>   
>  *Proposed Solution:* Barring some critical sections such as modifying state 
> of the backup on meta tables, others can happen parallelly. Leaving 
> incremental backups to be able to run based on older successful full / 
> incremental backups and completion time of backup should be used instead of 
> start time of backup for ordering. I have not worked on the full redesign, 
> and will be doing so if this proposal seems acceptable for the community.
>   
>  *Problem 2:*
>  With one backup at a time, it fails easily for a multi-tenant system. This 
> poses following problems
>  * Admins will not be able to achieve required RPO's for their tables because 
> of dependence on other tenants present in the system. As one tenant doesn't 
> have control over other tenants' table sizes and hence the duration of the 
> backup
>  * Management overhead of setting up a right sequence to achieve required 
> RPO's for different tenants could be very hard.
> *Proposed Solution:* Same as previous proposal
>   
>  *Problem 3:* 
>  Incremental backup works on WAL's and 
> org.apache.hadoop.hbase.backup.master.BackupLogCleaner ensures that WAL's are 
> never cleaned up until the next backup (Full / Incremental) is taken. This 
> poses following problem
>  * WAL's can grow unbounded in case there are transient problems like backup 
> site facing issues or anything else until next backup scheduled goes 
> successful
>  *Proposed Solution:* I can't think of anything better, but I see this can be 
> a potential problem. Also, one can force full backup if required WAL files 
> are missing for whatever other reasons not necessarily mentioned above. 
>   
> *Proposed Design.*
> !image-2021-06-03-16-34-34-957.png|width=324,height=416!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to