[
https://issues.apache.org/jira/browse/HBASE-29518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andor Molnar resolved HBASE-29518.
----------------------------------
Resolution: Fixed
> Support Moving Bulkloaded Files to External Storage in Continuous Backup
> ------------------------------------------------------------------------
>
> Key: HBASE-29518
> URL: https://issues.apache.org/jira/browse/HBASE-29518
> Project: HBase
> Issue Type: Task
> Components: backup&restore
> Reporter: Vinayak Hegde
> Assignee: Vinayak Hegde
> Priority: Major
>
> Currently, bulkloaded files are copied to external storage (e.g., S3) as part
> of incremental backup, but not during continuous backup. This leaves a gap in
> disaster recovery scenarios, as bulkloaded data may remain only on the source
> cluster. If the cluster storage is lost, those files are unrecoverable even
> if WALs are available.
> We had previously implemented bulkload handling in continuous backup but
> reverted it due to performance concerns
> (https://issues.apache.org/jira/browse/HBASE-29406). At that time, we assumed
> bulkload operations needed to be applied in strict order with WAL edits,
> which added complexity and overhead.
> *Why we are reconsidering this now:*
> * *High bulkload usage:* Many users regularly use bulkload (often at scale,
> e.g., generating HFiles with Spark and bulkloading them) as their primary
> data ingestion method.
> * *Order independence:* Recent discussions confirmed that in HBase, the
> order between WAL replay and bulkload operations does not matter, since all
> updates (put/delete) are timestamp-based. This allows us to first replay all
> WAL edits, then bulkload HFiles afterward, reducing complexity and
> performance impact.
> * *Disaster recovery importance:* Storing all backup data, including
> bulkloaded files, in an external location ensures recovery even if the entire
> HDFS cluster is inaccessible or destroyed. Keeping backups off-cluster is a
> best practice to protect against site-level failures.
> *Proposed approach:*
> * Update the continuous backup replication endpoint to copy bulkloaded files
> to the backup location.
> * Optimize performance through batching or asynchronous copying where
> possible.
> * Restore workflow: replay WAL entries first, then bulkload HFiles from the
> backup location.
> *Benefits:*
> * Ensures all ingested data is protected in the backup location.
> * Eliminates dependency on the source cluster for recovery.
> * Aligns continuous backup behavior with incremental backup for consistency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)