[jira] [Commented] (HBASE-29255) Integrate backup WAL cleanup logic with the delete command

Vinayak Hegde (Jira) Wed, 16 Apr 2025 07:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945081#comment-17945081
 ]


Vinayak Hegde commented on HBASE-29255:
---------------------------------------

h2. Objective

Clean up Write-Ahead Logs (WALs) backed up via _continuous backup_ that are no 
longer needed, i.e., WALs that cannot be used in any future Point-In-Time 
Restore (PITR).
h2. Background
h3. What is PITR?

Point-In-Time Restore (PITR) is the process of restoring table(s) to their 
state at a specific timestamp using:
 # Full Backup – Snapshot of the table(s) at a certain time.

 # Incremental Backup – Captures the delta (changes) since the last backup 
(full or incremental).

 # Continuous Backup – Continuously stores WALs in the backup location for PITR.

h3. Breakdown of Backup Types
 * Full Backup: Snapshot of table(s) at a point in time.

 * Incremental Backup: Deltas since the last backup.

 * Continuous Backup: WALs of all participating tables are backed up 
continuously, and stored in daily partitions.

*Important Notes:*
 * WAL files are stored under a daily structure like below:
{{/WALs
  /2025-03-25
    /wal_file.293232}}
 * WALs for all participating tables are stored together in a single file — not 
partitioned by table.

 * These WALs are not split at finer granularity (e.g., hourly or table-wise).

h2. Solution

When doing PITR, WALs help restore changes after the most recent backup. But 
WALs that are older than the earliest available full backup cannot be used for 
PITR and should be deleted.
h3. Why?

Because PITR always starts by restoring a full backup. Then:
 # Incremental backups (if any) are applied.

 # WALs are replayed from the snapshot time forward.

If a WAL was backed up before the earliest full backup, there’s no base state 
to apply it on, hence it becomes unusable.
h2. Determining the Cutoff for WAL Deletion

We need a cutoff timestamp before which WALs can be safely deleted. This 
timestamp is derived from the first (oldest) full backup.
h3. Timestamps in a Full Backup

A full backup includes the following timestamps:
 * fs: Full backup start time

 * fm: Snapshot time (logical freeze) — the actual consistent view

 * fe: Full backup end time

Ideal choice for cutoff: {{fm}} (snapshot time), because PITR uses this as the 
base state.

Reality: We don’t have access to {{{}fm{}}}. So, we use {{fs}} conservatively, 
even though it’s slightly earlier than the true snapshot point.
h3. Conclusion

All WALs older than the {{fs}} (start time) of the oldest full backup can be 
safely deleted.
h2. Implementation Details
h3. Approach
 # Get the oldest full backup using the backup system table.

 # Extract its start time (fs) in epoch format.

 # Convert that timestamp into a date, and use it to determine which WAL 
directories to delete.

 # Delete entire day-wise WAL directories that are strictly before the cutoff 
date.

h3. Why Use Day Boundaries?

Let’s say:
 * WAL data exists from Jan 1 to Jan 20

 * Cutoff time is Jan 15, 3:00 PM (epoch format)

Instead of trying to delete individual WALs from Jan 15 directory that fall 
before 3 PM:
 * We simply delete all WAL directories from Jan 1 to Jan 14

This avoids reading each WAL file and checking internal timestamps, which would:
 * Require parsing WAL file contents

 * Possibly splitting WAL files that span across the cutoff

 * Result in re-writing WALs, which adds complexity

Hence, rounding down to the nearest day is a reasonable and safe approximation.
h2. Edge Cases & Considerations
h3. Can we go further and delete more?

Possibly — yes.
h4. Example Scenario:
{code:java}
                             t1                                     t2          
      mapt                                         current time
-----------------------------|--------------------------------------|------------------|------------------------------------------->{code}
 * t1 = Oldest full backup

 * t2 = Incremental backup after t1

 * mapt = Current time - PITR window (e.g., 30 days)

Since PITR can only happen within a limited window ({{{}mapt{}}}), we might be 
able to delete WALs even after t1 — for example, between t1 and t2 — because:
 * We can use the incremental backup at t2 instead of WALs between t1 and t2.

 * WALs before {{mapt}} are outside PITR window anyway.

h3. Why not implement that?

It introduces many edge cases:
 * What if there is another full backup between t1 and t2?

 * What if {{mapt}} falls between t1 and t2?

 * How do we determine whether the WALs are fully covered by incremental 
backups?

Handling all these adds significant complexity for very little gain.

So, we stick to a safe, conservative strategy:
Only delete WALs that are older than the earliest full backup.
h2. Integration with Delete Command

This cleanup logic is tied to backups, so WAL cleanup:
 * Must happen after backup deletion

 * Is best integrated directly into the {{delete}} command

h3. Why?
 * If the full backup is deleted, the corresponding WALs become useless.

 * Cleanup must be triggered after backup deletion is successful.

*Plan:*
Extend the {{delete}} command to run this WAL cleanup logic after it has 
deleted backups.
h2. Summary
||Step||Action||
|1|Identify oldest full backup from system table|
|2|Extract {{fs}} timestamp (start time)|
|3|Convert timestamp to day-level cutoff|
|4|Delete all WALs strictly before that day|
|5|Integrate logic into {{delete}} command|

This approach is:
 * Safe – avoids risk of deleting usable WALs

 * Simple – avoids parsing/splitting individual files

 * Good enough – even if not the most optimised

> Integrate backup WAL cleanup logic with the delete command
> ----------------------------------------------------------
>
>                 Key: HBASE-29255
>                 URL: https://issues.apache.org/jira/browse/HBASE-29255
>             Project: HBase
>          Issue Type: Task
>            Reporter: Vinayak Hegde
>            Assignee: Vinayak Hegde
>            Priority: Major
>
> The {{delete}} command currently removes both full and incremental backups. 
> we plan to extend the command to also clean up WALs that were retained due to 
> the deleted backup. This will help free up storage and ensure proper cleanup 
> post-deletion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-29255) Integrate backup WAL cleanup logic with the delete command

Reply via email to