Vinayak Hegde created HBASE-29518:
-------------------------------------

             Summary: Support Moving Bulkloaded Files to External Storage in 
Continuous Backup
                 Key: HBASE-29518
                 URL: https://issues.apache.org/jira/browse/HBASE-29518
             Project: HBase
          Issue Type: Task
          Components: backup&restore
            Reporter: Vinayak Hegde
            Assignee: Vinayak Hegde


Currently, bulkloaded files are copied to external storage (e.g., S3) as part 
of incremental backup, but not during continuous backup. This leaves a gap in 
disaster recovery scenarios, as bulkloaded data may remain only on the source 
cluster. If the cluster storage is lost, those files are unrecoverable even if 
WALs are available.

We had previously implemented bulkload handling in continuous backup but 
reverted it due to performance concerns 
(https://issues.apache.org/jira/browse/HBASE-29406). At that time, we assumed 
bulkload operations needed to be applied in strict order with WAL edits, which 
added complexity and overhead.

*Why we are reconsidering this now:*
 * *High bulkload usage:* Many users regularly use bulkload (often at scale, 
e.g., generating HFiles with Spark and bulkloading them) as their primary data 
ingestion method.

 * *Order independence:* Recent discussions confirmed that in HBase, the order 
between WAL replay and bulkload operations does not matter, since all updates 
(put/delete) are timestamp-based. This allows us to first replay all WAL edits, 
then bulkload HFiles afterward, reducing complexity and performance impact.

 * *Disaster recovery importance:* Storing all backup data, including 
bulkloaded files, in an external location ensures recovery even if the entire 
HDFS cluster is inaccessible or destroyed. Keeping backups off-cluster is a 
best practice to protect against site-level failures.

*Proposed approach:*
 * Update the continuous backup replication endpoint to copy bulkloaded files 
to the backup location.

 * Optimize performance through batching or asynchronous copying where possible.

 * Restore workflow: replay WAL entries first, then bulkload HFiles from the 
backup location.

*Benefits:*
 * Ensures all ingested data is protected in the backup location.

 * Eliminates dependency on the source cluster for recovery.

 * Aligns continuous backup behavior with incremental backup for consistency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to