[jira] [Comment Edited] (HBASE-29133) Implement "pitr" Command for Point-in-Time Recovery/Restore

Vinayak Hegde (Jira) Thu, 17 Apr 2025 04:47:11 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-29133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945345#comment-17945345
 ]


Vinayak Hegde edited comment on HBASE-29133 at 4/17/25 11:46 AM:
-----------------------------------------------------------------

h2. Design
h3. Problem Statement

To simplify the problem:

We need to support {*}Point-in-Time Recovery (PITR){*}, where the user provides 
a set of tables and a target timestamp. The system should then restore these 
tables to the specified point in time.

This recovery will be performed using a combination of {*}full, incremental{*}, 
and *continuous backups* available in the backup locations.

The recovery process involves:
 # Restoring the most recent valid full backup.
 # Applying all relevant incremental backups.
 # Replaying Write-Ahead Logs (WALs) up to the requested point in time.

----
h3. Terminology
 * {*}replicationCheckpoint (rc){*}: The latest timestamp before which all data 
is guaranteed to be successfully backed up via continuous replication.

 * {*}maxAllowedPITRTime (mapt){*}: The maximum lookback duration (e.g., 30 
days) allowed for PITR. This is a cluster-level configuration. Thus, PITR is 
allowed from {{currentTime - mapt}} to {{{}currentTime{}}}.

 * {*}requestedToTime (rt){*}: The timestamp to which the user wants to restore 
the data.

 * {*}currentTime (ct){*}: The time at which the PITR request is being 
processed.

 * *fs:* backup start time (when backup begins).

 * {*}fm{*}: Logical snapshot time (when the snapshot is taken; represents the 
state of data).

 * {*}fe{*}: backup end time.

h3. Limitation

We do not have a reliable way to determine the exact {{fm}} from current 
metadata. Hence, we conservatively use {{fs}} or {{fe}} as a proxy.
----
h2. Validations

Several validations must be performed before proceeding:
h4. 1. Continuous Backup Enabled

If continuous backup is not enabled for any of the requested tables, the PITR 
request should be rejected.
h4. 2. Requested Time Within PITR Window

The requested point-in-time ({{{}rt{}}}) must lie within a valid PITR window, 
defined as:
 * {{rt >= currentTime - maxAllowedPITRTime}}

 * {{rt <= replicationCheckpoint}}

{quote}{*}Note{*}: While the PITR window theoretically ends at 
{{{}currentTime{}}}, replication may lag behind. For example, with a 1-hour 
lag, valid recovery is only possible up to {{{}currentTime - 1 hour{}}}.
{quote}
Refer to HBASE-29220 for more details about replication checkpoint.

 

So the valid range is:
{{mapt ≤ rt ≤ rc}}
h4. 3. Backup and WAL Coverage

We must ensure:
 * A valid base backup (full/incremental) exists before {{{}rt{}}}.

 * WAL files are available to bridge the gap between that backup and {{{}rt{}}}.

Steps for validation (per table):
 # Traverse backup records in reverse chronological order.
 # For each backup:
 ** Does it include the target table?
 ** Was it completed before rt? (i.e., fe <= rt)
 ** Do WALs cover from fs to rt?
 ** Are required ancestor backups available?
 # If no such backup is found, throw an error and exit.

If all checks pass:
 * The requested {{rt}} is valid.

 * A valid base backup and WALs are available for each table.

If the user requested {_}only validation{_}, return success at this point.
h3. Determining a Valid Backup

To consider a backup valid for PITR:
 # *Includes the table* being restored.
 # {*}Completed before {{requestedToTime}}{*}, i.e., {{{}fe <= rt{}}}.
 ** Since {{fm}} is not recorded, we conservatively use {{fe}} for validation.
 # {*}WALs are available from {{fs}} to {{rt}}{*}.
 ** If WALs exist from {{fs}} onward, they cover the unknown {{fm}} as well.
 # *All required ancestor backups* are available.

 * 
 ** Use existing mechanisms from the standard restore workflow to verify this.

----
h3. Execution

Once a valid base backup is identified and validated:
 # *Restore the backup* using the existing restore mechanism.
 # *Replay WALs* using the {{WALPlayer}} MapReduce job to bring the table to 
the requested point in time.

{quote}Since {{fm}} is unknown, we conservatively replay WALs from {{fs}} to 
{{{}requestedToTime{}}}.
{quote}


was (Author: JIRAUSER298877):
h2. Design
h3. Problem Statement

To simplify the problem:

We need to support {*}Point-in-Time Recovery (PITR){*}, where the user provides 
a set of tables and a target timestamp. The system should then restore these 
tables to the specified point in time.

This recovery will be performed using a combination of {*}full, incremental{*}, 
and *continuous backups* available in the backup locations.

The recovery process involves:
 # Restoring the most recent valid full backup.

 # Applying all relevant incremental backups.

 # Replaying Write-Ahead Logs (WALs) up to the requested point in time.

----
h3. Terminology
 * {*}replicationCheckpoint (rc){*}: The latest timestamp before which all data 
is guaranteed to be successfully backed up via continuous replication.

 * {*}maxAllowedPITRTime (mapt){*}: The maximum lookback duration (e.g., 30 
days) allowed for PITR. This is a cluster-level configuration. Thus, PITR is 
allowed from {{currentTime - mapt}} to {{{}currentTime{}}}.

 * {*}requestedToTime (rt){*}: The timestamp to which the user wants to restore 
the data.

 * {*}currentTime (ct){*}: The time at which the PITR request is being 
processed.

 * *fs:* backup start time (when backup begins).

 * {*}fm{*}: Logical snapshot time (when the snapshot is taken; represents the 
state of data).

 * {*}fe{*}: backup end time.

h3. Limitation

We do not have a reliable way to determine the exact {{fm}} from current 
metadata. Hence, we conservatively use {{fs}} or {{fe}} as a proxy.
----
h2. Validations

Several validations must be performed before proceeding:
h4. 1. Continuous Backup Enabled

If continuous backup is not enabled for any of the requested tables, the PITR 
request should be rejected.
h4. 2. Requested Time Within PITR Window

The requested point-in-time ({{{}rt{}}}) must lie within a valid PITR window, 
defined as:
 * {{rt >= currentTime - maxAllowedPITRTime}}

 * {{rt <= replicationCheckpoint}}

{quote}{*}Note{*}: While the PITR window theoretically ends at 
{{{}currentTime{}}}, replication may lag behind. For example, with a 1-hour 
lag, valid recovery is only possible up to {{{}currentTime - 1 hour{}}}.
{quote}
Refer to HBASE-29220 for more details about replication checkpoint.

{{}}

So the valid range is:
{{mapt ≤ rt ≤ rc}}
h4. 3. Backup and WAL Coverage

We must ensure:
 * A valid base backup (full/incremental) exists before {{{}rt{}}}.

 * WAL files are available to bridge the gap between that backup and {{{}rt{}}}.

Steps for validation (per table):
 # Traverse backup records in reverse chronological order.

 # For each backup:

 ** Does it include the target table?

 ** Was it completed before {{{}rt{}}}? (i.e., {{{}fe <= rt{}}})

 ** Do WALs cover from {{fs}} to {{{}rt{}}}?

 ** Are required ancestor backups available?

 # If no such backup is found, throw an error and exit.

If all checks pass:
 * The requested {{rt}} is valid.

 * A valid base backup and WALs are available for each table.

If the user requested {_}only validation{_}, return success at this point.
h3. Determining a Valid Backup

To consider a backup valid for PITR:
 # *Includes the table* being restored.

 # {*}Completed before {{requestedToTime}}{*}, i.e., {{{}fe <= rt{}}}.

 ** Since {{fm}} is not recorded, we conservatively use {{fe}} for validation.

 # {*}WALs are available from {{fs}} to {{rt}}{*}.

 ** If WALs exist from {{fs}} onward, they cover the unknown {{fm}} as well.

 # *All required ancestor backups* are available.

 ** Use existing mechanisms from the standard restore workflow to verify this.

----
h3. Execution

Once a valid base backup is identified and validated:
 # *Restore the backup* using the existing restore mechanism.

 # *Replay WALs* using the {{WALPlayer}} MapReduce job to bring the table to 
the requested point in time.

{quote}Since {{fm}} is unknown, we conservatively replay WALs from {{fs}} to 
{{{}requestedToTime{}}}.
{quote}

> Implement "pitr" Command for Point-in-Time Recovery/Restore
> -----------------------------------------------------------
>
>                 Key: HBASE-29133
>                 URL: https://issues.apache.org/jira/browse/HBASE-29133
>             Project: HBase
>          Issue Type: Task
>          Components: backup&amp;restore
>    Affects Versions: 2.6.0, 3.0.0-alpha-4
>            Reporter: Vinayak Hegde
>            Assignee: Vinayak Hegde
>            Priority: Major
>              Labels: pull-request-available
>
> h4. New "pitr" Command
> {code:java}
> hbase pitr 
>     [-t <table_name[,table_name]>] 
>     [-s <backup_set_name>] 
>     [-q <name>] 
>     [-c] 
>     [-m <target_tables>] 
>     [-o] 
>     [--to-datetime <end_time>] {code}
> h4. Process for Each Table:
>  # Identify the most recent backup taken *before* the {{--to-datetime}} 
> timestamp and execute the {{restore}} command for that table. This will apply 
> both full and incremental snapshots.
>  # Determine the WAL (Write-Ahead Log) replay duration, covering logs 
> generated after the last backup and before {{{}--to-datetime{}}}.
>  # Invoke *WALPlayer* with the {{{}backupdir{}}}, {{{}from-time{}}}, and 
> {{to-time}} parameters to perform WAL replay.
> h4. WAL Replay Details:
>  * The eligible day directories will be provided as a {*}comma-separated 
> list{*}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HBASE-29133) Implement "pitr" Command for Point-in-Time Recovery/Restore

Reply via email to