rmdmattingly commented on code in PR #6040:
URL: https://github.com/apache/hbase/pull/6040#discussion_r1768663357


##########
hbase-backup/src/main/java/org/apache/hadoop/hbase/backup/master/BackupLogCleaner.java:
##########
@@ -81,39 +81,55 @@ public void init(Map<String, Object> params) {
     }
   }
 
-  private Map<Address, Long> getServerToNewestBackupTs(List<BackupInfo> 
backups)
+  /**
+   * Calculates the timestamp boundary up to which all backup roots have 
already included the WAL.
+   * I.e. WALs with a lower (= older) or equal timestamp are no longer needed 
for future incremental
+   * backups.
+   */
+  private Map<Address, Long> serverToPreservationBoundaryTs(List<BackupInfo> 
backups)

Review Comment:
   Hmm, okay so reframing your example in a single root:
   
   Imagine 
   3 servers S1, S2, S3
   table T1 has regions on S1 and S2, table T2 has regions on S2 and S3 
   we have 1 backup root R1 that backs up T1 and T2
   
   At time:
   t=0, we backup T1 in R1 => backup B1
   t=0, we backup T2 in R1 => backup B2
   t=10, we backup T1 in R1 => backup B3
   t=20, we backup T2 in R1 => backup B4
   
   Following the logic in this method:
   
   `newestBackupPerRootDir` will contain: (R1: B4)
   
   We would calculate the preservation boundaries here:
   ```java
       // This map tracks, for every address, the least recent (= oldest / 
lowest timestamp) inclusion
       // in any backup. In other words, it is the timestamp boundary up to 
which all backups roots
       // have included the WAL in their backup.
       Map<Address, Long> boundaries = new HashMap<>();
       for (BackupInfo backupInfo : newestBackupPerRootDir.values()) {
         for (TableName table : backupInfo.getTables()) {
           for (Map.Entry<String, Long> entry : 
backupInfo.getTableSetTimestampMap().get(table)
             .entrySet()) {
             Address address = Address.fromString(entry.getKey());
             Long storedTs = boundaries.get(address);
             if (storedTs == null || entry.getValue() < storedTs) {
               boundaries.put(address, entry.getValue());
             }
           }
         }
       }
   ```
   
   Since the newest backup in R1 is B4, `serverToPreservationBoundaryTs` will 
contain (S1: 20, S2: 20, S3: 20)
   
   In this situation, we must not delete WALs from S1 or S2 between times 10 
and 20 because a subsequent incremental backup of T1 will require those WALs.
   
   In the BackupLogCleaner, though, we will end up calling `canDeleteFile` and 
hitting this code with that `serverToPreservationBoundaryTs` (S1: 20, S2: 20, 
S3: 20) renamed as `addressToBoundaryTs`:
   
   ```java
         if (!addressToBoundaryTs.containsKey(walServerAddress)) { // for S1 & 
S2, this will be false
            // irrelevant ...
         }
   
         Long backupBoundary = addressToBoundaryTs.get(walServerAddress); // 
for S1 & S2, this will be 20
         if (backupBoundary >= walTimestamp) { // For WALs between times 10 and 
20, this will be true
           if (LOG.isDebugEnabled()) {
             LOG.debug(
               "Backup found for server: {}. All backups from {} are newer than 
file, so deleting: {}",
               walServerAddress.getHostName(), backupBoundary, path);
           }
           // so we will erroneously delete WALs <20, despite the next inc 
backup of T1 requiring them
           return true;
         }
   ```
   
   I guess this depends on how ancestry is defined? If we consider all backups 
in a root to be ancestors regardless of their table set, then maybe it is okay 
to delete these WALs. But, if not, then I don't see how it is okay?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to