[
https://issues.apache.org/jira/browse/HBASE-29255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945081#comment-17945081
]
Vinayak Hegde edited comment on HBASE-29255 at 4/16/25 2:09 PM:
----------------------------------------------------------------
h2. Objective
Clean up Write-Ahead Logs (WALs) backed up via _continuous backup_ that are no
longer needed, i.e., WALs that cannot be used in any future Point-In-Time
Restore (PITR).
h2. Background
h3. What is PITR?
Point-In-Time Restore (PITR) is the process of restoring table(s) to their
state at a specific timestamp using:
# Full Backup – Snapshot of the table(s) at a certain time.
# Incremental Backup – Captures the delta (changes) since the last backup
(full or incremental).
# Continuous Backup – Continuously stores WALs in the backup location for PITR.
h3. Breakdown of Backup Types
* Full Backup: Snapshot of table(s) at a point in time.
* Incremental Backup: Deltas since the last backup.
* Continuous Backup: WALs of all participating tables are backed up
continuously, and stored in daily partitions.
*Important Notes:*
* WAL files are stored under a daily structure like below:
/WALs
/2025-03-25
/wal_file.293232
* WALs for all participating tables are stored together in a single file — not
partitioned by table.
* These WALs are not split at finer granularity (e.g., hourly or table-wise).
h2. Solution
When doing PITR, WALs help restore changes after the most recent backup. But
WALs that are older than the earliest available full backup cannot be used for
PITR and should be deleted.
h3. Why?
Because PITR always starts by restoring a full backup. Then:
# Incremental backups (if any) are applied.
# WALs are replayed from the snapshot time forward.
If a WAL was backed up before the earliest full backup, there’s no base state
to apply it on, hence it becomes unusable.
h2. Determining the Cutoff for WAL Deletion
We need a cutoff timestamp before which WALs can be safely deleted. This
timestamp is derived from the first (oldest) full backup.
h3. Timestamps in a Full Backup
A full backup includes the following timestamps:
* fs: Full backup start time
* fm: Snapshot time (logical freeze) — the actual consistent view
* fe: Full backup end time
Ideal choice for cutoff: {{fm}} (snapshot time), because PITR uses this as the
base state.
Reality: We don’t have access to {{{}fm{}}}. So, we use {{fs}} conservatively,
even though it’s slightly earlier than the true snapshot point.
h3. Conclusion
All WALs older than the {{fs}} (start time) of the oldest full backup can be
safely deleted.
h2. Implementation Details
h3. Approach
# Get the oldest full backup using the backup system table.
# Extract its start time (fs) in epoch format.
# Convert that timestamp into a date, and use it to determine which WAL
directories to delete.
# Delete entire day-wise WAL directories that are strictly before the cutoff
date.
h3. Why Use Day Boundaries?
Let’s say:
* WAL data exists from Jan 1 to Jan 20
* Cutoff time is Jan 15, 3:00 PM (epoch format)
Instead of trying to delete individual WALs from Jan 15 directory that fall
before 3 PM:
* We simply delete all WAL directories from Jan 1 to Jan 14
This avoids reading each WAL file and checking internal timestamps, which would:
* Require parsing WAL file contents
* Possibly splitting WAL files that span across the cutoff
* Result in re-writing WALs, which adds complexity
Hence, rounding down to the nearest day is a reasonable and safe approximation.
h2. Edge Cases & Considerations
h3. Can we go further and delete more?
Possibly — yes.
h4. Example Scenario:
{code:java}
t1 t2
mapt current time
-----------------------------|--------------------------------------|------------------|------------------------------------------->{code}
* t1 = Oldest full backup
* t2 = Incremental backup after t1
* mapt = Current time - PITR window (e.g., 30 days)
Since PITR can only happen within a limited window ({{{}mapt{}}}), we might be
able to delete WALs even after t1 — for example, between t1 and t2 — because:
* We can use the incremental backup at t2 instead of WALs between t1 and t2.
* WALs before {{mapt}} are outside PITR window anyway.
h3. Why not implement that?
It introduces many edge cases:
* What if there is another full backup between t1 and t2?
* What if {{mapt}} falls between t1 and t2?
* How do we determine whether the WALs are fully covered by incremental
backups?
Handling all these adds significant complexity for very little gain.
So, we stick to a safe, conservative strategy:
Only delete WALs that are older than the earliest full backup.
h2. Integration with Delete Command
This cleanup logic is tied to backups, so WAL cleanup:
* Must happen after backup deletion
* Is best integrated directly into the {{delete}} command
h3. Why?
* If the full backup is deleted, the corresponding WALs become useless.
* Cleanup must be triggered after backup deletion is successful.
*Plan:*
Extend the {{delete}} command to run this WAL cleanup logic after it has
deleted backups.
h2. Summary
||Step||Action||
|1|Identify oldest full backup from system table|
|2|Extract {{fs}} timestamp (start time)|
|3|Convert timestamp to day-level cutoff|
|4|Delete all WALs strictly before that day|
|5|Integrate logic into {{delete}} command|
This approach is:
* Safe – avoids risk of deleting usable WALs
* Simple – avoids parsing/splitting individual files
* Good enough – even if not the most optimised
was (Author: JIRAUSER298877):
h2. Objective
Clean up Write-Ahead Logs (WALs) backed up via _continuous backup_ that are no
longer needed, i.e., WALs that cannot be used in any future Point-In-Time
Restore (PITR).
h2. Background
h3. What is PITR?
Point-In-Time Restore (PITR) is the process of restoring table(s) to their
state at a specific timestamp using:
# Full Backup – Snapshot of the table(s) at a certain time.
# Incremental Backup – Captures the delta (changes) since the last backup
(full or incremental).
# Continuous Backup – Continuously stores WALs in the backup location for PITR.
h3. Breakdown of Backup Types
* Full Backup: Snapshot of table(s) at a point in time.
* Incremental Backup: Deltas since the last backup.
* Continuous Backup: WALs of all participating tables are backed up
continuously, and stored in daily partitions.
*Important Notes:*
* WAL files are stored under a daily structure like below:
{{/WALs
/2025-03-25
/wal_file.293232}}
* WALs for all participating tables are stored together in a single file — not
partitioned by table.
* These WALs are not split at finer granularity (e.g., hourly or table-wise).
h2. Solution
When doing PITR, WALs help restore changes after the most recent backup. But
WALs that are older than the earliest available full backup cannot be used for
PITR and should be deleted.
h3. Why?
Because PITR always starts by restoring a full backup. Then:
# Incremental backups (if any) are applied.
# WALs are replayed from the snapshot time forward.
If a WAL was backed up before the earliest full backup, there’s no base state
to apply it on, hence it becomes unusable.
h2. Determining the Cutoff for WAL Deletion
We need a cutoff timestamp before which WALs can be safely deleted. This
timestamp is derived from the first (oldest) full backup.
h3. Timestamps in a Full Backup
A full backup includes the following timestamps:
* fs: Full backup start time
* fm: Snapshot time (logical freeze) — the actual consistent view
* fe: Full backup end time
Ideal choice for cutoff: {{fm}} (snapshot time), because PITR uses this as the
base state.
Reality: We don’t have access to {{{}fm{}}}. So, we use {{fs}} conservatively,
even though it’s slightly earlier than the true snapshot point.
h3. Conclusion
All WALs older than the {{fs}} (start time) of the oldest full backup can be
safely deleted.
h2. Implementation Details
h3. Approach
# Get the oldest full backup using the backup system table.
# Extract its start time (fs) in epoch format.
# Convert that timestamp into a date, and use it to determine which WAL
directories to delete.
# Delete entire day-wise WAL directories that are strictly before the cutoff
date.
h3. Why Use Day Boundaries?
Let’s say:
* WAL data exists from Jan 1 to Jan 20
* Cutoff time is Jan 15, 3:00 PM (epoch format)
Instead of trying to delete individual WALs from Jan 15 directory that fall
before 3 PM:
* We simply delete all WAL directories from Jan 1 to Jan 14
This avoids reading each WAL file and checking internal timestamps, which would:
* Require parsing WAL file contents
* Possibly splitting WAL files that span across the cutoff
* Result in re-writing WALs, which adds complexity
Hence, rounding down to the nearest day is a reasonable and safe approximation.
h2. Edge Cases & Considerations
h3. Can we go further and delete more?
Possibly — yes.
h4. Example Scenario:
{code:java}
t1 t2
mapt current time
-----------------------------|--------------------------------------|------------------|------------------------------------------->{code}
* t1 = Oldest full backup
* t2 = Incremental backup after t1
* mapt = Current time - PITR window (e.g., 30 days)
Since PITR can only happen within a limited window ({{{}mapt{}}}), we might be
able to delete WALs even after t1 — for example, between t1 and t2 — because:
* We can use the incremental backup at t2 instead of WALs between t1 and t2.
* WALs before {{mapt}} are outside PITR window anyway.
h3. Why not implement that?
It introduces many edge cases:
* What if there is another full backup between t1 and t2?
* What if {{mapt}} falls between t1 and t2?
* How do we determine whether the WALs are fully covered by incremental
backups?
Handling all these adds significant complexity for very little gain.
So, we stick to a safe, conservative strategy:
Only delete WALs that are older than the earliest full backup.
h2. Integration with Delete Command
This cleanup logic is tied to backups, so WAL cleanup:
* Must happen after backup deletion
* Is best integrated directly into the {{delete}} command
h3. Why?
* If the full backup is deleted, the corresponding WALs become useless.
* Cleanup must be triggered after backup deletion is successful.
*Plan:*
Extend the {{delete}} command to run this WAL cleanup logic after it has
deleted backups.
h2. Summary
||Step||Action||
|1|Identify oldest full backup from system table|
|2|Extract {{fs}} timestamp (start time)|
|3|Convert timestamp to day-level cutoff|
|4|Delete all WALs strictly before that day|
|5|Integrate logic into {{delete}} command|
This approach is:
* Safe – avoids risk of deleting usable WALs
* Simple – avoids parsing/splitting individual files
* Good enough – even if not the most optimised
> Integrate backup WAL cleanup logic with the delete command
> ----------------------------------------------------------
>
> Key: HBASE-29255
> URL: https://issues.apache.org/jira/browse/HBASE-29255
> Project: HBase
> Issue Type: Task
> Reporter: Vinayak Hegde
> Assignee: Vinayak Hegde
> Priority: Major
>
> The {{delete}} command currently removes both full and incremental backups.
> we plan to extend the command to also clean up WALs that were retained due to
> the deleted backup. This will help free up storage and ensure proper cleanup
> post-deletion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)