[
https://issues.apache.org/jira/browse/HBASE-29133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945345#comment-17945345
]
Vinayak Hegde edited comment on HBASE-29133 at 4/17/25 11:46 AM:
-----------------------------------------------------------------
h2. Design
h3. Problem Statement
To simplify the problem:
We need to support {*}Point-in-Time Recovery (PITR){*}, where the user provides
a set of tables and a target timestamp. The system should then restore these
tables to the specified point in time.
This recovery will be performed using a combination of {*}full, incremental{*},
and *continuous backups* available in the backup locations.
The recovery process involves:
# Restoring the most recent valid full backup.
# Applying all relevant incremental backups.
# Replaying Write-Ahead Logs (WALs) up to the requested point in time.
----
h3. Terminology
* {*}replicationCheckpoint (rc){*}: The latest timestamp before which all data
is guaranteed to be successfully backed up via continuous replication.
* {*}maxAllowedPITRTime (mapt){*}: The maximum lookback duration (e.g., 30
days) allowed for PITR. This is a cluster-level configuration. Thus, PITR is
allowed from {{currentTime - mapt}} to {{{}currentTime{}}}.
* {*}requestedToTime (rt){*}: The timestamp to which the user wants to restore
the data.
* {*}currentTime (ct){*}: The time at which the PITR request is being
processed.
* *fs:* backup start time (when backup begins).
* {*}fm{*}: Logical snapshot time (when the snapshot is taken; represents the
state of data).
* {*}fe{*}: backup end time.
h3. Limitation
We do not have a reliable way to determine the exact {{fm}} from current
metadata. Hence, we conservatively use {{fs}} or {{fe}} as a proxy.
----
h2. Validations
Several validations must be performed before proceeding:
h4. 1. Continuous Backup Enabled
If continuous backup is not enabled for any of the requested tables, the PITR
request should be rejected.
h4. 2. Requested Time Within PITR Window
The requested point-in-time ({{{}rt{}}}) must lie within a valid PITR window,
defined as:
* {{rt >= currentTime - maxAllowedPITRTime}}
* {{rt <= replicationCheckpoint}}
{quote}{*}Note{*}: While the PITR window theoretically ends at
{{{}currentTime{}}}, replication may lag behind. For example, with a 1-hour
lag, valid recovery is only possible up to {{{}currentTime - 1 hour{}}}.
{quote}
Refer to HBASE-29220 for more details about replication checkpoint.
So the valid range is:
{{mapt ≤ rt ≤ rc}}
h4. 3. Backup and WAL Coverage
We must ensure:
* A valid base backup (full/incremental) exists before {{{}rt{}}}.
* WAL files are available to bridge the gap between that backup and {{{}rt{}}}.
Steps for validation (per table):
# Traverse backup records in reverse chronological order.
# For each backup:
** Does it include the target table?
** Was it completed before rt? (i.e., fe <= rt)
** Do WALs cover from fs to rt?
** Are required ancestor backups available?
# If no such backup is found, throw an error and exit.
If all checks pass:
* The requested {{rt}} is valid.
* A valid base backup and WALs are available for each table.
If the user requested {_}only validation{_}, return success at this point.
h3. Determining a Valid Backup
To consider a backup valid for PITR:
# *Includes the table* being restored.
# {*}Completed before {{requestedToTime}}{*}, i.e., {{{}fe <= rt{}}}.
** Since {{fm}} is not recorded, we conservatively use {{fe}} for validation.
# {*}WALs are available from {{fs}} to {{rt}}{*}.
** If WALs exist from {{fs}} onward, they cover the unknown {{fm}} as well.
# *All required ancestor backups* are available.
*
** Use existing mechanisms from the standard restore workflow to verify this.
----
h3. Execution
Once a valid base backup is identified and validated:
# *Restore the backup* using the existing restore mechanism.
# *Replay WALs* using the {{WALPlayer}} MapReduce job to bring the table to
the requested point in time.
{quote}Since {{fm}} is unknown, we conservatively replay WALs from {{fs}} to
{{{}requestedToTime{}}}.
{quote}
was (Author: JIRAUSER298877):
h2. Design
h3. Problem Statement
To simplify the problem:
We need to support {*}Point-in-Time Recovery (PITR){*}, where the user provides
a set of tables and a target timestamp. The system should then restore these
tables to the specified point in time.
This recovery will be performed using a combination of {*}full, incremental{*},
and *continuous backups* available in the backup locations.
The recovery process involves:
# Restoring the most recent valid full backup.
# Applying all relevant incremental backups.
# Replaying Write-Ahead Logs (WALs) up to the requested point in time.
----
h3. Terminology
* {*}replicationCheckpoint (rc){*}: The latest timestamp before which all data
is guaranteed to be successfully backed up via continuous replication.
* {*}maxAllowedPITRTime (mapt){*}: The maximum lookback duration (e.g., 30
days) allowed for PITR. This is a cluster-level configuration. Thus, PITR is
allowed from {{currentTime - mapt}} to {{{}currentTime{}}}.
* {*}requestedToTime (rt){*}: The timestamp to which the user wants to restore
the data.
* {*}currentTime (ct){*}: The time at which the PITR request is being
processed.
* *fs:* backup start time (when backup begins).
* {*}fm{*}: Logical snapshot time (when the snapshot is taken; represents the
state of data).
* {*}fe{*}: backup end time.
h3. Limitation
We do not have a reliable way to determine the exact {{fm}} from current
metadata. Hence, we conservatively use {{fs}} or {{fe}} as a proxy.
----
h2. Validations
Several validations must be performed before proceeding:
h4. 1. Continuous Backup Enabled
If continuous backup is not enabled for any of the requested tables, the PITR
request should be rejected.
h4. 2. Requested Time Within PITR Window
The requested point-in-time ({{{}rt{}}}) must lie within a valid PITR window,
defined as:
* {{rt >= currentTime - maxAllowedPITRTime}}
* {{rt <= replicationCheckpoint}}
{quote}{*}Note{*}: While the PITR window theoretically ends at
{{{}currentTime{}}}, replication may lag behind. For example, with a 1-hour
lag, valid recovery is only possible up to {{{}currentTime - 1 hour{}}}.
{quote}
Refer to HBASE-29220 for more details about replication checkpoint.
{{}}
So the valid range is:
{{mapt ≤ rt ≤ rc}}
h4. 3. Backup and WAL Coverage
We must ensure:
* A valid base backup (full/incremental) exists before {{{}rt{}}}.
* WAL files are available to bridge the gap between that backup and {{{}rt{}}}.
Steps for validation (per table):
# Traverse backup records in reverse chronological order.
# For each backup:
** Does it include the target table?
** Was it completed before {{{}rt{}}}? (i.e., {{{}fe <= rt{}}})
** Do WALs cover from {{fs}} to {{{}rt{}}}?
** Are required ancestor backups available?
# If no such backup is found, throw an error and exit.
If all checks pass:
* The requested {{rt}} is valid.
* A valid base backup and WALs are available for each table.
If the user requested {_}only validation{_}, return success at this point.
h3. Determining a Valid Backup
To consider a backup valid for PITR:
# *Includes the table* being restored.
# {*}Completed before {{requestedToTime}}{*}, i.e., {{{}fe <= rt{}}}.
** Since {{fm}} is not recorded, we conservatively use {{fe}} for validation.
# {*}WALs are available from {{fs}} to {{rt}}{*}.
** If WALs exist from {{fs}} onward, they cover the unknown {{fm}} as well.
# *All required ancestor backups* are available.
** Use existing mechanisms from the standard restore workflow to verify this.
----
h3. Execution
Once a valid base backup is identified and validated:
# *Restore the backup* using the existing restore mechanism.
# *Replay WALs* using the {{WALPlayer}} MapReduce job to bring the table to
the requested point in time.
{quote}Since {{fm}} is unknown, we conservatively replay WALs from {{fs}} to
{{{}requestedToTime{}}}.
{quote}
> Implement "pitr" Command for Point-in-Time Recovery/Restore
> -----------------------------------------------------------
>
> Key: HBASE-29133
> URL: https://issues.apache.org/jira/browse/HBASE-29133
> Project: HBase
> Issue Type: Task
> Components: backup&restore
> Affects Versions: 2.6.0, 3.0.0-alpha-4
> Reporter: Vinayak Hegde
> Assignee: Vinayak Hegde
> Priority: Major
> Labels: pull-request-available
>
> h4. New "pitr" Command
> {code:java}
> hbase pitr
> [-t <table_name[,table_name]>]
> [-s <backup_set_name>]
> [-q <name>]
> [-c]
> [-m <target_tables>]
> [-o]
> [--to-datetime <end_time>] {code}
> h4. Process for Each Table:
> # Identify the most recent backup taken *before* the {{--to-datetime}}
> timestamp and execute the {{restore}} command for that table. This will apply
> both full and incremental snapshots.
> # Determine the WAL (Write-Ahead Log) replay duration, covering logs
> generated after the last backup and before {{{}--to-datetime{}}}.
> # Invoke *WALPlayer* with the {{{}backupdir{}}}, {{{}from-time{}}}, and
> {{to-time}} parameters to perform WAL replay.
> h4. WAL Replay Details:
> * The eligible day directories will be provided as a {*}comma-separated
> list{*}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)