Vinayak Hegde created HBASE-29518: ------------------------------------- Summary: Support Moving Bulkloaded Files to External Storage in Continuous Backup Key: HBASE-29518 URL: https://issues.apache.org/jira/browse/HBASE-29518 Project: HBase Issue Type: Task Components: backup&restore Reporter: Vinayak Hegde Assignee: Vinayak Hegde
Currently, bulkloaded files are copied to external storage (e.g., S3) as part of incremental backup, but not during continuous backup. This leaves a gap in disaster recovery scenarios, as bulkloaded data may remain only on the source cluster. If the cluster storage is lost, those files are unrecoverable even if WALs are available. We had previously implemented bulkload handling in continuous backup but reverted it due to performance concerns (https://issues.apache.org/jira/browse/HBASE-29406). At that time, we assumed bulkload operations needed to be applied in strict order with WAL edits, which added complexity and overhead. *Why we are reconsidering this now:* * *High bulkload usage:* Many users regularly use bulkload (often at scale, e.g., generating HFiles with Spark and bulkloading them) as their primary data ingestion method. * *Order independence:* Recent discussions confirmed that in HBase, the order between WAL replay and bulkload operations does not matter, since all updates (put/delete) are timestamp-based. This allows us to first replay all WAL edits, then bulkload HFiles afterward, reducing complexity and performance impact. * *Disaster recovery importance:* Storing all backup data, including bulkloaded files, in an external location ensures recovery even if the entire HDFS cluster is inaccessible or destroyed. Keeping backups off-cluster is a best practice to protect against site-level failures. *Proposed approach:* * Update the continuous backup replication endpoint to copy bulkloaded files to the backup location. * Optimize performance through batching or asynchronous copying where possible. * Restore workflow: replay WAL entries first, then bulkload HFiles from the backup location. *Benefits:* * Ensures all ingested data is protected in the backup location. * Eliminates dependency on the source cluster for recovery. * Aligns continuous backup behavior with incremental backup for consistency. -- This message was sent by Atlassian Jira (v8.20.10#820010)