[ 
https://issues.apache.org/jira/browse/HBASE-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235462#comment-16235462
 ] 

Amit Kabra edited comment on HBASE-19105 at 11/2/17 9:45 AM:
-------------------------------------------------------------

> it is not possible to execute two identical backups on different clusters

If we can create time range buckets and put data into those buckets. Then we 
can compare data of those buckets. Eg say we got data at time 2 and at time 8 
and we have buckets as 0_5 and 5_10 then we will put 2 in 0_5 bucket and 8 in 
5_10 bucket. When we compare data we will compare only the corresponding 
buckets in primary and DR.

> they will always be slightly different due to the replication lag.

That would be there but it will not be in days always, if that's the case then 
that is anyway a issue. Considering acceptable delay of minutes / hours , we 
can always compare sometime back data and current data will get compared in 
next comparison run triggered (trigger time can be configured.)

>  I would suggest doing backups only in DR cluster. In this case you will 
> always have single source of backed up data.

For site-switching primary and dr site should be in sync all the time so that 
we can do switch. If we do backups only in DR, in worst scenario , what if DR 
goes down ? , we cannot initiate new backups since that would not contain all 
the past data (deleted data, expired data, versions beyond max versions ,etc 
...)

Not saying its easy and straight forward though.


was (Author: amitkabraiiit):
> it is not possible to execute two identical backups on different clusters

If we can create time range buckets and put data into those buckets. Then we 
can compare data of those buckets. Eg say we got data at time 2 and at time 8 
and we have buckets as 0_5 and 5_10 then we will put 2 in 0_5 bucket and 8 in 
5_10 bucket. When we compare data we will compare only the corresponding 
buckets in primary and DR.

> they will always be slightly different due to the replication lag.

Replication delay - yes, that would be there but it will not be in days always, 
if that's the case then that is anyway a issue. Considering acceptable delay of 
minutes / hours , we can always compare sometime back data.

>  I would suggest doing backups only in DR cluster. In this case you will 
> always have single source of backed up data.

For site-switching primary and dr site should be in sync all the time so that 
we can do switch. If we do backups only in DR, in worst scenario , what if DR 
goes down ? , we cannot initiate new backups since that would not contain all 
the past data (deleted data, expired data, versions beyond max versions ,etc 
...)

Not saying its easy and straight forward though.

> Add ability to compare backups in HBase backups.
> ------------------------------------------------
>
>                 Key: HBASE-19105
>                 URL: https://issues.apache.org/jira/browse/HBASE-19105
>             Project: HBase
>          Issue Type: New Feature
>          Components: backup&restore
>            Reporter: Amit Kabra
>            Priority: Major
>
> For certain scenarios eg DR scenario, before making a site switch we need to 
> ensure that backups in primary and dr is same. Tool to compare the backups 
> helps in such case that can do cross cluster backups validation.
>  
> Current backups generate data in backup_<timestamp> format and this can be 
> different in primary and dr and is not easily comparable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to