[
https://issues.apache.org/jira/browse/HBASE-29905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Van Besien updated HBASE-29905:
-----------------------------------
Description:
The backup:system table stores trslm: (table-region-server-log-map) rows with
the row key format: {{trslm:\0}}
Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL
timestamp}}}, representing the WAL position up to which each RegionServer has
been backed up for that table.
BackupLogCleaner uses this information to decide what WAL files to cleanup, as
follows:
* During backup completion (FullTableBackupClient.java:192 /
IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp() writes
a trslm: row for each table in the backup, recording the latest WAL timestamp
per RS.
* Immediately after, readLogTimestampMap() (BackupSystemTable.java:802) scans
all trslm: rows for that backup root — every table that has ever been backed up
to that root, not just the tables in the current backup. This full map is
stored into the BackupInfo object (backupInfo.setTableSetTimestampMap(...)) and
persisted as part of the session: row in backup:system.
* BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent
BackupInfo per backup root and iterates over its tableSetTimestampMap. For each
RegionServer found across all tables, it computes the minimum timestamp as the
"preservation boundary" for that server. WALs older than or equal to this
boundary can be deleted; newer ones are retained. A single stale table with a
year-old timestamp for any RS will pin WAL retention for that RS all the way
back, preventing WAL cleanup.
The root cause is that there is no code anywhere that deletes trslm: rows. They
are only written (overwritten) when a backup runs for that specific table. Two
scenarios create stale rows:
* (a) Table removed from backup (because the table is no longer included in
backups or simple because the table is deleted).
* (b) Regionserver decommissioned
Problem (a) was observed in production (workaround was to remove the stale
entries manually).
To fix this, I think we need to have a cleanup mechanism. Perhaps we can filter
readLogTimestampMap() results to only include tables in the current backup
info, and delete everything else (or only filter, without delete, but then the
stale entries still remain in the table).
was:
The backup:system table stores trslm: (table-region-server-log-map) rows with
the row key format: {{trslm:\0}}
Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL
timestamp}}}, representing the WAL position up to which each RegionServer has
been backed up for that table.
BackupLogCleaner uses this information to decide what WAL files to cleanup, as
follows:
* During backup completion (FullTableBackupClient.java:192 /
IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp() writes
a trslm: row for each table in the backup, recording the latest WAL timestamp
per RS.
* Immediately after, readLogTimestampMap() (BackupSystemTable.java:802) scans
all trslm: rows for that backup root — every table that has ever been backed up
to that root, not just the tables in the current backup. This full map is
stored into the BackupInfo object (backupInfo.setTableSetTimestampMap(...)) and
persisted as part of the session: row in backup:system.
* BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent
BackupInfo per backup root and iterates over its tableSetTimestampMap. For each
RegionServer found across all tables, it computes the minimum timestamp as the
"preservation boundary" for that server. WALs older than or equal to this
boundary can be deleted; newer ones are retained. A single stale table with a
year-old timestamp for any RS will pin WAL retention for that RS all the way
back, preventing WAL cleanup.
The root cause is that there is no code anywhere that deletes trslm: rows. They
are only written (overwritten) when a backup runs for that specific table. Two
scenarios create stale rows:
* (a) Table removed from backup (because the table is no longer included in
backups or simple because the table is deleted).
* (b) Regionserver decommissioned
Problem (a) was observed in production.
To fix this, I think we need to have a cleanup mechanism. Perhaps we can filter
readLogTimestampMap() results to only include tables in the current backup
info, and delete everything else (or only filter, without delete, but then the
stale entries still remain in the table).
> BackupLogCleaner retains old WAL files due to stale entries in system:backup
> table
> ----------------------------------------------------------------------------------
>
> Key: HBASE-29905
> URL: https://issues.apache.org/jira/browse/HBASE-29905
> Project: HBase
> Issue Type: Bug
> Components: backup&restore
> Reporter: Jan Van Besien
> Priority: Major
>
> The backup:system table stores trslm: (table-region-server-log-map) rows with
> the row key format: {{trslm:\0}}
> Each row's value is a protobuf-serialized map of {{\{RegionServer → WAL
> timestamp}}}, representing the WAL position up to which each RegionServer has
> been backed up for that table.
> BackupLogCleaner uses this information to decide what WAL files to cleanup,
> as follows:
> * During backup completion (FullTableBackupClient.java:192 /
> IncrementalTableBackupClient.java:330), writeRegionServerLogTimestamp()
> writes a trslm: row for each table in the backup, recording the latest WAL
> timestamp per RS.
> * Immediately after, readLogTimestampMap() (BackupSystemTable.java:802)
> scans all trslm: rows for that backup root — every table that has ever been
> backed up to that root, not just the tables in the current backup. This full
> map is stored into the BackupInfo object
> (backupInfo.setTableSetTimestampMap(...)) and persisted as part of the
> session: row in backup:system.
> * BackupLogCleaner (BackupLogCleaner.java:89-142) reads the most recent
> BackupInfo per backup root and iterates over its tableSetTimestampMap. For
> each RegionServer found across all tables, it computes the minimum timestamp
> as the "preservation boundary" for that server. WALs older than or equal to
> this boundary can be deleted; newer ones are retained. A single stale table
> with a year-old timestamp for any RS will pin WAL retention for that RS all
> the way back, preventing WAL cleanup.
> The root cause is that there is no code anywhere that deletes trslm: rows.
> They are only written (overwritten) when a backup runs for that specific
> table. Two scenarios create stale rows:
> * (a) Table removed from backup (because the table is no longer included in
> backups or simple because the table is deleted).
> * (b) Regionserver decommissioned
> Problem (a) was observed in production (workaround was to remove the stale
> entries manually).
> To fix this, I think we need to have a cleanup mechanism. Perhaps we can
> filter readLogTimestampMap() results to only include tables in the current
> backup info, and delete everything else (or only filter, without delete, but
> then the stale entries still remain in the table).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)