[jira] [Comment Edited] (HBASE-29261) Investigate flaw in backup deletion validation of PITR-critical backups and propose correct approach

Vinayak Hegde (Jira) Tue, 22 Apr 2025 05:09:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17944425#comment-17944425
 ]


Vinayak Hegde edited comment on HBASE-29261 at 4/22/25 12:08 PM:
-----------------------------------------------------------------

h2. Objective of New Logic

Instead of checking if a backup is the _only_ valid one for all PITR points, we 
aim to:
 * *Identify the exact PITR time range* that a particular full backup can 
support as a base.

 * If *another full backup* also covers that same range, then the current 
backup is *redundant* and can be deleted.

h2. Terminologies
 * {{ct}} (current time): Current timestamp

 * {{mapt}} (maxAllowedPITRTime): Maximum time range to which the user can 
perform PITR. Typically configured at cluster level (e.g., 30 days ago)

 * {{cst}} (continuousBackupStartTime): The earliest time from which continuous 
backup data is available for a table

 * {{{}fs{}}}: Full backup start time (not reliably usable)

 * {{{}fm{}}}: Time when the actual snapshot (logical freeze) was taken — 
slightly later than {{fs}} (we don’t have access to this)

 * {{{}fe{}}}: Full backup end time (we will use this instead of {{fm}} for 
conservative estimation)

h2. Assumptions
 * A full backup captures the table data as of {{{}fm{}}}, which lies somewhere 
after {{{}fs{}}}.

 * After the snapshot is taken ({{{}fm{}}}), data may be added to the table 
even before the backup finishes ({{{}fe{}}}). But since the snapshot freezes 
the state, the backup only contains data up to {{{}fm{}}}.

 * *Since we don’t have access to {{{}fm{}}}, we will conservatively use {{fe}} 
to determine the lower bound of data that the backup might reliably support.*

h2. Case Analysis

*Timeline & Notation Explanation*

Timeline
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------| 
{code}
Backup Notation:
{code:java}
|---|------|
^   ^      ^
|   |      |
fs  fm    fe{code}
 
----
h4. *Case 1: {{continuousBackupStartTime < maxAllowedPITRTime}}*

This means we have continuous backup coverage even before the PITR time limit. 
In this case, the PITR window spans from {{mapt}} to {{{}ct{}}}.
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
We evaluate different full backups based on their {{{}fe{}}}:
 * If {{{}fe < cst{}}}: The backup does not cover any PITR points. After 
restoring this backup, the table is at a point ({{{}fm{}}}) before {{{}cst{}}}. 
Since continuous backup is only available from {{{}cst{}}}, we cannot reach any 
point in the PITR range.

 * If {{{}cst <= fe < mapt{}}}: Continuous backup exists from {{{}cst{}}}, so 
we can recover from {{mapt}} onwards, but not before. Hence, this backup can 
support {{{}[mapt, ct]{}}}.

 * If {{{}fe >= mapt{}}}: We can recover from {{fe}} onwards, so this backup 
supports {{{}[fe, ct]{}}}.

*Conclusion for Case 1:*
 * Let {{effectiveStart = mapt}}

 * If {{{}fe >= cst{}}}: PITR range supported = {{[max(fe, effectiveStart), 
ct]}}

 * Else: Backup does not cover any PITR points

----
h4. *Case 2: {{maxAllowedPITRTime < continuousBackupStartTime}}*

Here, the user has configured PITR for a time range earlier than we have data 
for. So even though the system allows recovery up to {{{}ct - 30 days{}}}, we 
only have continuous backup data from a later time ({{{}cst{}}}), and that 
limits what is actually recoverable.
{code:java}
                   mapt                 cst                                 ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
 * Since PITR recovery is only possible from {{{}cst{}}}, {{mapt}} becomes 
irrelevant.

 * The effective PITR window is {{[cst, ct]}}

 * If {{{}fe >= cst{}}}: PITR range supported = {{[fe, ct]}}

 * Else: No recovery possible with this backup

*Conclusion for Case 2:*
 * Let {{effectiveStart = cst}}

 * If {{{}fe >= effectiveStart{}}}: PITR range supported = {{[max(fe, 
effectiveStart), ct]}}

 * Else: Backup does not cover any PITR points

----
h4. *Case 3: {{continuousBackupStartTime == maxAllowedPITRTime}}*

This is effectively the same as Case 2 in behavior, with the recovery window 
starting from a single defined time.
{code:java}
                            cst,mapt                              ct
-------------------------------|----------------------------------|
a.                   |--|----|
b.                         |--|----|
c.                            |--|----|
d.                                 |--|----| {code}
 * If {{{}fe >= cst{}}}: PITR range supported = {{[fe, ct]}}

 * Else: No PITR support

*Conclusion for Case 3:*
 * Let {{effectiveStart = cst}} (which is equal to {{{}mapt{}}})

 * If {{{}fe >= effectiveStart{}}}: PITR range supported = {{[max(fe, 
effectiveStart), ct]}}

 * Else: Backup does not cover any PITR points

 
h2. Final General Rule (Applicable in All Cases)
{code:java}
effectiveStart = max(cst, mapt)

If fe >= cst:
    PITR range = [max(fe, effectiveStart), ct]
Else:
    PITR range = None (backup does not help with PITR) {code}


was (Author: JIRAUSER298877):
h2. Objective of New Logic

Instead of checking if a backup is the _only_ valid one for all PITR points, we 
aim to:
 * *Identify the exact PITR time range* that a particular full backup can 
support as a base.

 * If *another full backup* also covers that same range, then the current 
backup is *redundant* and can be deleted.

h2. Terminologies
 * {{ct}} (current time): Current timestamp

 * {{mapt}} (maxAllowedPITRTime): Maximum time range to which the user can 
perform PITR. Typically configured at cluster level (e.g., 30 days ago)

 * {{cst}} (continuousBackupStartTime): The earliest time from which continuous 
backup data is available for a table

 * {{{}fs{}}}: Full backup start time (not reliably usable)

 * {{{}fm{}}}: Time when the actual snapshot (logical freeze) was taken — 
slightly later than {{fs}} (we don’t have access to this)

 * {{{}fe{}}}: Full backup end time (we will use this instead of {{fm}} for 
conservative estimation)

h2. Assumptions
 * A full backup captures the table data as of {{{}fm{}}}, which lies somewhere 
after {{{}fs{}}}.

 * After the snapshot is taken ({{{}fm{}}}), data may be added to the table 
even before the backup finishes ({{{}fe{}}}). But since the snapshot freezes 
the state, the backup only contains data up to {{{}fm{}}}.

 * *Since we don’t have access to {{{}fm{}}}, we will conservatively use {{fe}} 
to determine the lower bound of data that the backup might reliably support.*

h2. Case Analysis

*Timeline & Notation Explanation*

Timeline
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------| 
{code}
Backup Notation:
{code:java}
|---|------|
^   ^      ^
|   |      |
fs  fm    fe{code}
 
----
h4. *Case 1: {{continuousBackupStartTime < maxAllowedPITRTime}}*

This means we have continuous backup coverage even before the PITR time limit. 
In this case, the PITR window spans from {{mapt}} to {{{}ct{}}}.
{code:java}
                   cst                 mapt                                ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
We evaluate different full backups based on their {{{}fe{}}}:
 * If {{{}fe < cst{}}}: The backup does not cover any PITR points. After 
restoring this backup, the table is at a point ({{{}fm{}}}) before {{{}cst{}}}. 
Since continuous backup is only available from {{{}cst{}}}, we cannot reach any 
point in the PITR range.

 * If {{{}cst <= fe < mapt{}}}: Continuous backup exists from {{{}cst{}}}, so 
we can recover from {{mapt}} onwards, but not before. Hence, this backup can 
support {{{}[mapt, ct]{}}}.

 * If {{{}fe >= mapt{}}}: We can recover from {{fe}} onwards, so this backup 
supports {{{}[fe, ct]{}}}.

*Conclusion for Case 1:*
 * Let {{effectiveStart = mapt}}

 * If {{{}fe >= effectiveStart{}}}: PITR range supported = {{[max(fe, 
effectiveStart), ct]}}

 * Else: Backup does not cover any PITR points

----
h4. *Case 2: {{maxAllowedPITRTime < continuousBackupStartTime}}*

Here, the user has configured PITR for a time range earlier than we have data 
for. So even though the system allows recovery up to {{{}ct - 30 days{}}}, we 
only have continuous backup data from a later time ({{{}cst{}}}), and that 
limits what is actually recoverable.
{code:java}
                   mapt                 cst                                 ct
--------------------|--------------------|----------------------------------|
a.  |--|----| 
b.              |--|----| 
c.                 |--|----| 
d.                        |--|----| 
e.                                  |--|----| 
f.                                     |--|----| 
g.                                               |--|----|  {code}
 * Since PITR recovery is only possible from {{{}cst{}}}, {{mapt}} becomes 
irrelevant.

 * The effective PITR window is {{[cst, ct]}}

 * If {{{}fe >= cst{}}}: PITR range supported = {{[fe, ct]}}

 * Else: No recovery possible with this backup

*Conclusion for Case 2:*
 * Let {{effectiveStart = cst}}

 * If {{{}fe >= effectiveStart{}}}: PITR range supported = {{[max(fe, 
effectiveStart), ct]}}

 * Else: Backup does not cover any PITR points

----
h4. *Case 3: {{continuousBackupStartTime == maxAllowedPITRTime}}*

This is effectively the same as Case 2 in behavior, with the recovery window 
starting from a single defined time.
{code:java}
                            cst,mapt                              ct
-------------------------------|----------------------------------|
a.                   |--|----|
b.                         |--|----|
c.                            |--|----|
d.                                 |--|----| {code}
 * If {{{}fe >= cst{}}}: PITR range supported = {{[fe, ct]}}

 * Else: No PITR support

*Conclusion for Case 3:*
 * Let {{effectiveStart = cst}} (which is equal to {{{}mapt{}}})

 * If {{{}fe >= effectiveStart{}}}: PITR range supported = {{[max(fe, 
effectiveStart), ct]}}

 * Else: Backup does not cover any PITR points

 
h2. Final General Rule (Applicable in All Cases)
{code:java}
effectiveStart = max(cst, mapt)

If fe >= effectiveStart:
    PITR range = [max(fe, effectiveStart), ct]
Else:
    PITR range = None (backup does not help with PITR) {code}

> Investigate flaw in backup deletion validation of PITR-critical backups and 
> propose correct approach
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29261
>                 URL: https://issues.apache.org/jira/browse/HBASE-29261
>             Project: HBase
>          Issue Type: Task
>          Components: backup&amp;restore
>            Reporter: Vinayak Hegde
>            Assignee: Vinayak Hegde
>            Priority: Major
>
> This Jira investigates a flaw in our current logic used to validate whether a 
> full backup—potentially critical for PITR (Point-In-Time Recovery)—can be 
> safely deleted.
> The current approach incorrectly checks whether a full backup is the only 
> valid base for *all* PITR target points, which is not a valid criterion. A 
> full backup should not be required to support _all_ PITR points to be 
> considered necessary. Instead, each full backup only contributes to a 
> {*}specific PITR time range{*}, depending on when the backup was taken and 
> the availability of continuous backups afterward.
> This ticket proposes a more accurate and conservative approach:
>  * Determine the PITR range each full backup can support.
>  * Identify if another full backup exists that fully covers the same range.
>  * If such a backup exists, the original one can be considered safe for 
> deletion.
> All edge cases and reasoning are explained in the comments for clarity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-29261) Investigate flaw in backup deletion validation of PITR-critical backups and propose correct approach

Reply via email to