[ 
https://issues.apache.org/jira/browse/HBASE-29905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Van Besien updated HBASE-29905:
-----------------------------------
    Description: 
The backup:system table stores trslm: (table-region-server-log-map) rows with 
the row key format: {{trslm:\0}}

Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL 
timestamp}}}, representing the WAL position up to which each RegionServer has 
been backed up for that table.

BackupLogCleaner uses this information to decide what WAL files to cleanup, as 
follows:
 * During backup completion (FullTableBackupClient.java:192 / 
IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp() writes 
a trslm: row for each table in the backup, recording the latest WAL timestamp 
per RS.
 * Immediately after, readLogTimestampMap() (BackupSystemTable.java:802) scans 
all trslm: rows for that backup root — every table that has ever been backed up 
to that root, not just the tables in the current backup. This full map is 
stored into the BackupInfo object (backupInfo.setTableSetTimestampMap(...)) and 
persisted as part of the session: row in backup:system.
 * BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent 
BackupInfo per backup root and iterates over its tableSetTimestampMap. For each 
RegionServer found across all tables, it computes the minimum timestamp as the 
"preservation boundary" for that server. WALs older than or equal to this 
boundary can be deleted; newer ones are retained. A single stale table with a 
year-old timestamp for any RS will pin WAL retention for that RS all the way 
back, preventing WAL cleanup.

The root cause is that there is no code anywhere that deletes trslm: rows. They 
are only written (overwritten) when a backup runs for that specific table. Two 
scenarios create stale rows:
 * (a) Table removed from backup (because the table is no longer included in 
backups or simple because the table is deleted).
 * (b) Regionserver decommissioned

Problem (a) was observed in production (workaround was to remove the stale 
entries manually).

To fix this, I think we need to have a cleanup mechanism. Perhaps we can filter 
readLogTimestampMap() results to only include tables in the current backup 
info, and delete everything else (or only filter, without delete, but then the 
stale entries still remain in the table).

  was:
The backup:system table stores trslm: (table-region-server-log-map) rows with 
the row key format: {{trslm:\0}}

Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL 
timestamp}}}, representing the WAL position up to which each RegionServer has 
been backed up for that table.

BackupLogCleaner uses this information to decide what WAL files to cleanup, as 
follows:
 * During backup completion (FullTableBackupClient.java:192 / 
IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp() writes 
a trslm: row for each table in the backup, recording the latest WAL timestamp 
per RS.
 * Immediately after, readLogTimestampMap() (BackupSystemTable.java:802) scans 
all trslm: rows for that backup root — every table that has ever been backed up 
to that root, not just the tables in the current backup. This full map is 
stored into the BackupInfo object (backupInfo.setTableSetTimestampMap(...)) and 
persisted as part of the session: row in backup:system.
 * BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent 
BackupInfo per backup root and iterates over its tableSetTimestampMap. For each 
RegionServer found across all tables, it computes the minimum timestamp as the 
"preservation boundary" for that server. WALs older than or equal to this 
boundary can be deleted; newer ones are retained. A single stale table with a 
year-old timestamp for any RS will pin WAL retention for that RS all the way 
back, preventing WAL cleanup.

The root cause is that there is no code anywhere that deletes trslm: rows. They 
are only written (overwritten) when a backup runs for that specific table. Two 
scenarios create stale rows:
 * (a) Table removed from backup (because the table is no longer included in 
backups or simple because the table is deleted).
 * (b) Regionserver decommissioned

Problem (a) was observed in production.

To fix this, I think we need to have a cleanup mechanism. Perhaps we can filter 
readLogTimestampMap() results to only include tables in the current backup 
info, and delete everything else (or only filter, without delete, but then the 
stale entries still remain in the table).


> BackupLogCleaner retains old WAL files due to stale entries in system:backup 
> table
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-29905
>                 URL: https://issues.apache.org/jira/browse/HBASE-29905
>             Project: HBase
>          Issue Type: Bug
>          Components: backup&restore
>            Reporter: Jan Van Besien
>            Priority: Major
>
> The backup:system table stores trslm: (table-region-server-log-map) rows with 
> the row key format: {{trslm:\0}}
> Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL 
> timestamp}}}, representing the WAL position up to which each RegionServer has 
> been backed up for that table.
> BackupLogCleaner uses this information to decide what WAL files to cleanup, 
> as follows:
>  * During backup completion (FullTableBackupClient.java:192 / 
> IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp() 
> writes a trslm: row for each table in the backup, recording the latest WAL 
> timestamp per RS.
>  * Immediately after, readLogTimestampMap() (BackupSystemTable.java:802) 
> scans all trslm: rows for that backup root — every table that has ever been 
> backed up to that root, not just the tables in the current backup. This full 
> map is stored into the BackupInfo object 
> (backupInfo.setTableSetTimestampMap(...)) and persisted as part of the 
> session: row in backup:system.
>  * BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent 
> BackupInfo per backup root and iterates over its tableSetTimestampMap. For 
> each RegionServer found across all tables, it computes the minimum timestamp 
> as the "preservation boundary" for that server. WALs older than or equal to 
> this boundary can be deleted; newer ones are retained. A single stale table 
> with a year-old timestamp for any RS will pin WAL retention for that RS all 
> the way back, preventing WAL cleanup.
> The root cause is that there is no code anywhere that deletes trslm: rows. 
> They are only written (overwritten) when a backup runs for that specific 
> table. Two scenarios create stale rows:
>  * (a) Table removed from backup (because the table is no longer included in 
> backups or simple because the table is deleted).
>  * (b) Regionserver decommissioned
> Problem (a) was observed in production (workaround was to remove the stale 
> entries manually).
> To fix this, I think we need to have a cleanup mechanism. Perhaps we can 
> filter readLogTimestampMap() results to only include tables in the current 
> backup info, and delete everything else (or only filter, without delete, but 
> then the stale entries still remain in the table).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to