Catherine Turner created HBASE-30218:
----------------------------------------
Summary: Backup repair permanently holds lock when FULL backup
fails in REQUEST phase
Key: HBASE-30218
URL: https://issues.apache.org/jira/browse/HBASE-30218
Project: HBase
Issue Type: Bug
Components: backup&restore
Reporter: Catherine Turner
When a FULL backup fails while still in REQUEST phase (before ExportSnapshot
ever runs), the repair path throws an exception and aborts before releasing the
backup exclusive lock. The cluster is then permanently wedged: every subsequent
backup attempt fails with "There is an active session already running", which
triggers repair, which throws again. This is an unrecoverable loop that cannot
be broken without manual intervention (clearing the lock by hand).
The root cause is that TableBackupClient.cleanupExportSnapshotLog
unconditionally attempts to construct a staging-dir Path from the
snapshot.export.staging.root configuration property. When that property is
unset, constructing new Path(null) throws an IllegalArgumentException.
For a backup that never progressed past REQUEST phase, ExportSnapshot was never
invoked and there are no MapReduce log directories to clean up; the call to
cleanupExportSnapshotLog should be a no-op. Instead, the unchecked exception
escapes cleanupAndRestoreBackupSystem, which means the exclusive lock is never
released.
----
+Steps to reproduce+
A full backup that stalls or is killed before the export phase leaves the
session in REQUEST phase with Progress=0%:
{noformat}
hbase backup history
{ID=backup_1780142338094, Type=FULL, Tables={...}, State=RUNNING,
Start time=..., Phase=REQUEST, Progress=0%}
{noformat}
The exclusive lock is still held. Every subsequent backup run fails and the
end-of-run repair throws:
{noformat}
ERROR o.a.h.h.backup.impl.BackupAdminImpl There is an active session already
running
...
ERROR ...BackupRepair Failed to run backup repair
java.lang.IllegalArgumentException: Can not create a Path from a null string
at
org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupExportSnapshotLog(TableBackupClient.java:169)
at
org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupAndRestoreBackupSystem(TableBackupClient.java:270)
at ...
{noformat}
----
+To resolve this issue...+
* cleanupExportSnapshotLog should guard against a null or absent
snapshot.export.staging.root. If the property is unset, the method should
return early (there is nothing to clean up).
* Additionally, cleanupAndRestoreBackupSystem should ensure the backup
exclusive lock is released in a finally block so that even an unexpected
exception in cleanup cannot leave the lock permanently held.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)