[
https://issues.apache.org/jira/browse/HBASE-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hernan Romer reassigned HBASE-30218:
------------------------------------
Assignee: (was: Hernan Romer)
> Backup repair permanently holds lock when FULL backup fails in REQUEST phase
> ----------------------------------------------------------------------------
>
> Key: HBASE-30218
> URL: https://issues.apache.org/jira/browse/HBASE-30218
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Catherine Turner
> Priority: Major
>
> When a FULL backup fails while still in REQUEST phase (before ExportSnapshot
> ever runs), the repair path throws an exception and aborts before releasing
> the backup exclusive lock. The cluster is then permanently wedged: every
> subsequent backup attempt fails with "There is an active session already
> running", which triggers repair, which throws again. This is an unrecoverable
> loop that cannot be broken without manual intervention (clearing the lock by
> hand).
> The root cause is that TableBackupClient.cleanupExportSnapshotLog
> unconditionally attempts to construct a staging-dir Path from the
> snapshot.export.staging.root configuration property. When that property is
> unset, constructing new Path(null) throws an IllegalArgumentException.
> For a backup that never progressed past REQUEST phase, ExportSnapshot was
> never invoked and there are no MapReduce log directories to clean up; the
> call to
> cleanupExportSnapshotLog should be a no-op. Instead, the unchecked exception
> escapes cleanupAndRestoreBackupSystem, which means the exclusive lock is
> never released.
> ----
> +Steps to reproduce+
> A full backup that stalls or is killed before the export phase leaves the
> session in REQUEST phase with Progress=0%:
> {noformat}
> hbase backup history
> {ID=backup_1780142338094, Type=FULL, Tables={...}, State=RUNNING,
> Start time=..., Phase=REQUEST, Progress=0%}
> {noformat}
> The exclusive lock is still held. Every subsequent backup run fails and the
> end-of-run repair throws:
> {noformat}
> ERROR o.a.h.h.backup.impl.BackupAdminImpl There is an active session
> already running
> ...
> ERROR ...BackupRepair Failed to run backup repair
> java.lang.IllegalArgumentException: Can not create a Path from a null string
> at
> org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupExportSnapshotLog(TableBackupClient.java:169)
> at
> org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupAndRestoreBackupSystem(TableBackupClient.java:270)
> at ...
> {noformat}
> ----
> +To resolve this issue...+
> * cleanupExportSnapshotLog should guard against a null or absent
> snapshot.export.staging.root. If the property is unset, the method should
> return early (there is nothing to clean up).
> * Additionally, cleanupAndRestoreBackupSystem should ensure the backup
> exclusive lock is released in a finally block so that even an unexpected
> exception in cleanup cannot leave the lock permanently held.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)