[ 
https://issues.apache.org/jira/browse/HBASE-29448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-29448:
-----------------------------------
    Labels: pull-request-available  (was: )

> Modern backup failures can cause backup system to lock up
> ---------------------------------------------------------
>
>                 Key: HBASE-29448
>                 URL: https://issues.apache.org/jira/browse/HBASE-29448
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Hernan Gelaf-Romer
>            Assignee: Hernan Gelaf-Romer
>            Priority: Major
>              Labels: pull-request-available
>
> Prior to any backup operation, a snapshot of the backup:system table will be 
> taken. If the backup operation fails, we attempt to restore the backup system 
> table from the snapshot. This is done as a way to revert to a previously 
> successful state.
>  
> In order to restore, we run different 
> [procedures|https://github.com/apache/hbase/blob/8ddf925daac7af48a5b624c6192bd2cdc45f7955/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/impl/BackupSystemTable.java#L1400]
>  in sequence. The problem is that these procedures aren't guaranteed to run 
> back to back without interference; there are no atomicity guarantees that 
> prevent other {{backup:system}} operations from being interleaved here. This 
> can cause the backup system to get into a stuck state, Where it is unable to 
> proceed until it receives manual intervention.
>  
> For example, we may fail a backup for whatever reason, and go to restore from 
> the snapshot. However, an EnableTableProcedure might sneak through and run 
> between the DisableTableProcedure and the restore snapshot. A concrete 
> example is the BackupHFileCleaner running and enabling the {{backup:system}} 
> table when it creates a 
> [BackupSystemTable|https://github.com/apache/hbase/blob/8ddf925daac7af48a5b624c6192bd2cdc45f7955/hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/BackupHFileCleaner.java#L69]
>  object
>  
> {code:java}
> 2025-07-06T11:39:23,061 [hfile_cleaner-dir-scan-pool-285] INFO 
> org.apache.hadoop.hbase.client.HBaseAdmin: Started enable of 
> backup:system{code}
>  
> Now, subsequent backups cannot run, b/c they cannot snapshot the table due to 
> an existing snapshot that wasn't correctly cleaned up
>  
> {code:java}
> 2025-07-06 11:41:48.004 [pool-115-thread-1] ERROR 
> o.a.h.h.b.impl.TableBackupClient - Unexpected Exception : 
> org.apache.hadoop.hbase.snapshot.SnapshotExistsException: Snapshot 
> 'snapshot_backup_system' already stored on the filesystem. at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.sanityCheckBeforeSnapshot(SnapshotManager.java:804)
>  at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.access$000(SnapshotManager.java:127)
>  at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager$1.run(SnapshotManager.java:725)
>  at 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:132)
>  at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.submitSnapshotProcedure(SnapshotManager.java:722)
>  at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:713)
>  at 
> org.apache.hadoop.hbase.master.MasterRpcServices.snapshot(MasterRpcServices.java:1723)
>  at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:443) at 
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at 
> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105) at 
> org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85)
> {code}
> I can think of two possible solutions:
>  # Create a specific procedure that restores the {{backup:system}} table from 
> a snapshot that takes a table-level lock against the backup system table. 
> This would ensure the restore process runs without interference
>  # Implement a different checkpointing system, that doesn't require 
> snapshotting the backup system table
> I'm err'ing towards the first option, as it would be easier to implement, and 
> wouldn't require a massive re-work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to