Ray Mattingly created HBASE-28643:
-------------------------------------

             Summary: An unbounded backup failure message can cause an 
irrecoverable state for the given backup
                 Key: HBASE-28643
                 URL: https://issues.apache.org/jira/browse/HBASE-28643
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.6.0
            Reporter: Ray Mattingly


The BackupInfo class has a failedMsg field which is a string of unbounded 
length. When a DistCp job fails then its failure message contains all of its 
source paths, and its failure message gets propagated to this failedMsg field 
on the given BackupInfo.

If a DistCp job has enough source paths, then this will result in backup status 
updates being rejected:
{noformat}
java.lang.IllegalArgumentException: KeyValue size too large
        at 
org.apache.hadoop.hbase.client.ConnectionUtils.validatePut(ConnectionUtils.java:513)
        at org.apache.hadoop.hbase.client.HTable.validatePut(HTable.java:1095)
        at org.apache.hadoop.hbase.client.HTable.lambda$put$3(HTable.java:564)
        at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187)
        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:563)
        at 
org.apache.hadoop.hbase.backup.impl.BackupSystemTable.updateBackupInfo(BackupSystemTable.java:292)
        at 
org.apache.hadoop.hbase.backup.impl.BackupManager.updateBackupInfo(BackupManager.java:376)
        at 
org.apache.hadoop.hbase.backup.impl.TableBackupClient.failBackup(TableBackupClient.java:243)
        at 
org.apache.hadoop.hbase.backup.impl.IncrementalTableBackupClient.execute(IncrementalTableBackupClient.java:317)
        at 
org.apache.hadoop.hbase.backup.impl.BackupAdminImpl.backupTables(BackupAdminImpl.java:603)
        at 
com.hubspot.hbase.recovery.core.backup.BackupManager.lambda$runBackups$2(BackupManager.java:145){noformat}
Without the ability to update the backup's state, it will never be returned as 
a failed backup by the client. This means that any mechanisms designed for 
repairing or cleaning failed backups won't work properly.

I think that a simple fix here would be fine: we should truncate the failedMsg 
field to a reasonable maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to