[ 
https://issues.apache.org/jira/browse/HBASE-29518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andor Molnar resolved HBASE-29518.
----------------------------------
    Resolution: Fixed

> Support Moving Bulkloaded Files to External Storage in Continuous Backup
> ------------------------------------------------------------------------
>
>                 Key: HBASE-29518
>                 URL: https://issues.apache.org/jira/browse/HBASE-29518
>             Project: HBase
>          Issue Type: Task
>          Components: backup&restore
>            Reporter: Vinayak Hegde
>            Assignee: Vinayak Hegde
>            Priority: Major
>
> Currently, bulkloaded files are copied to external storage (e.g., S3) as part 
> of incremental backup, but not during continuous backup. This leaves a gap in 
> disaster recovery scenarios, as bulkloaded data may remain only on the source 
> cluster. If the cluster storage is lost, those files are unrecoverable even 
> if WALs are available.
> We had previously implemented bulkload handling in continuous backup but 
> reverted it due to performance concerns 
> (https://issues.apache.org/jira/browse/HBASE-29406). At that time, we assumed 
> bulkload operations needed to be applied in strict order with WAL edits, 
> which added complexity and overhead.
> *Why we are reconsidering this now:*
>  * *High bulkload usage:* Many users regularly use bulkload (often at scale, 
> e.g., generating HFiles with Spark and bulkloading them) as their primary 
> data ingestion method.
>  * *Order independence:* Recent discussions confirmed that in HBase, the 
> order between WAL replay and bulkload operations does not matter, since all 
> updates (put/delete) are timestamp-based. This allows us to first replay all 
> WAL edits, then bulkload HFiles afterward, reducing complexity and 
> performance impact.
>  * *Disaster recovery importance:* Storing all backup data, including 
> bulkloaded files, in an external location ensures recovery even if the entire 
> HDFS cluster is inaccessible or destroyed. Keeping backups off-cluster is a 
> best practice to protect against site-level failures.
> *Proposed approach:*
>  * Update the continuous backup replication endpoint to copy bulkloaded files 
> to the backup location.
>  * Optimize performance through batching or asynchronous copying where 
> possible.
>  * Restore workflow: replay WAL entries first, then bulkload HFiles from the 
> backup location.
> *Benefits:*
>  * Ensures all ingested data is protected in the backup location.
>  * Eliminates dependency on the source cluster for recovery.
>  * Aligns continuous backup behavior with incremental backup for consistency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to