Catherine Turner created HBASE-30218:
----------------------------------------

             Summary: Backup repair permanently holds lock when FULL backup 
fails in REQUEST phase
                 Key: HBASE-30218
                 URL: https://issues.apache.org/jira/browse/HBASE-30218
             Project: HBase
          Issue Type: Bug
          Components: backup&restore
            Reporter: Catherine Turner


When a FULL backup fails while still in REQUEST phase (before ExportSnapshot 
ever runs), the repair path throws an exception and aborts before releasing the 
backup exclusive lock. The cluster is then permanently wedged: every subsequent 
backup attempt fails with "There is an active session already running", which 
triggers repair, which throws again. This is an unrecoverable loop that cannot 
be broken without manual intervention (clearing the lock by hand).

The root cause is that TableBackupClient.cleanupExportSnapshotLog 
unconditionally attempts to construct a staging-dir Path from the 
snapshot.export.staging.root configuration property. When that property is 
unset, constructing new Path(null) throws an IllegalArgumentException.

For a backup that never progressed past REQUEST phase, ExportSnapshot was never 
invoked and there are no MapReduce log directories to clean up; the call to
cleanupExportSnapshotLog should be a no-op. Instead, the unchecked exception 
escapes cleanupAndRestoreBackupSystem, which means the exclusive lock is never 
released.
----
+Steps to reproduce+

A full backup that stalls or is killed before the export phase leaves the 
session in REQUEST phase with Progress=0%:
{noformat}
hbase backup history
{ID=backup_1780142338094, Type=FULL, Tables={...}, State=RUNNING,
 Start time=..., Phase=REQUEST, Progress=0%}
{noformat}
The exclusive lock is still held. Every subsequent backup run fails and the 
end-of-run repair throws:
{noformat}
ERROR o.a.h.h.backup.impl.BackupAdminImpl   There is an active session already 
running
...
ERROR ...BackupRepair        Failed to run backup repair
java.lang.IllegalArgumentException: Can not create a Path from a null string
    at 
org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupExportSnapshotLog(TableBackupClient.java:169)
    at 
org.apache.hadoop.hbase.backup.impl.TableBackupClient.cleanupAndRestoreBackupSystem(TableBackupClient.java:270)
    at ...
{noformat}
----
+To resolve this issue...+
 * cleanupExportSnapshotLog should guard against a null or absent 
snapshot.export.staging.root. If the property is unset, the method should 
return early (there is nothing to clean up).
 * Additionally, cleanupAndRestoreBackupSystem should ensure the backup 
exclusive lock is released in a finally block so that even an unexpected 
exception in cleanup cannot leave the lock permanently held.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to