[
https://issues.apache.org/jira/browse/HBASE-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Rodionov updated HBASE-15227:
--------------------------------------
Description:
System must be tolerant to faults:
# Backup operations MUST be atomic (no partial completion state in the backup
system table)
# Process must detect any type of failures which can result in a data loss
(partial backup or partial restore)
# Proper system table state restore and cleanup must be done in case of a
failure
# Additional utility to repair backup system table and corresponding file
system cleanup must be implemented
h3. Backup
h4. General FT framework implementation
Before actual backup operation starts, snapshot of a backup system table is
taken and system table is updated with *ACTIVE_SNAPSHOT* flag. The flag will be
removed upon backup completion.
In case of *any* server-side failures, client catches errors/exceptions and
handles them:
# Cleans up backup destination (removes partial backup data)
# Cleans up any temporary data
# Deletes any active snapshots of a tables being backed up (during full backup
we snapshot tables)
# Restores backup system table from snapshot
# Deletes backup system table snapshot (we read snapshot name from backup
system table before)
In case of *any* client-side failures:
Before any backup or restore operation run we check backup system table on
*ACTIVE_SNAPSHOT*, if flag is present, operation aborts with a message that
backup repair tool (see below) must be run
h4. Backup repair tool
The command line tool *backup repair* which executes the following steps:
# Reads info of a last failed backup session
# Cleans up backup destination (removes partial backup data)
# Cleans up any temporary data
# Deletes any active snapshots of a tables being backed up (during full backup
we snapshot tables)
# Restores backup system table from snapshot
# Deletes backup system table snapshot (we read snapshot name from backup
system table before)
h4. Detection of a partial loss of data
h5. Full backup
Export snapshot operation (?).
We count files and check sizes before and after DistCp run
h5. Incremental backup
Conversion of WAL to HFiles, when WAL file is moved from active to archive
directory. The code is in place to handle this situation
During DistCp run (same as above)
h3. Restore
This operation does not modify backup system table and is idempotent. No
special FT is required.
was:
System must be tolerant to faults:
# Backup operations MUST be atomic (no partial completion state in the backup
system table)
# Process must detect any type of failures which can result in a data loss
(partial backup or partial restore)
# Proper system table state restore and cleanup must be done in case of a
failure
# Additional utility to repair backup system table and corresponding file
system cleanup must be implemented
> HBase Backup Phase 3: Fault tolerance (client/server) support
> -------------------------------------------------------------
>
> Key: HBASE-15227
> URL: https://issues.apache.org/jira/browse/HBASE-15227
> Project: HBase
> Issue Type: Task
> Reporter: Vladimir Rodionov
> Assignee: Vladimir Rodionov
> Priority: Blocker
> Labels: backup
> Fix For: 2.0.0
>
> Attachments: HBASE-15227-v3.patch, HBASE-15277-v1.patch
>
>
> System must be tolerant to faults:
> # Backup operations MUST be atomic (no partial completion state in the backup
> system table)
> # Process must detect any type of failures which can result in a data loss
> (partial backup or partial restore)
> # Proper system table state restore and cleanup must be done in case of a
> failure
> # Additional utility to repair backup system table and corresponding file
> system cleanup must be implemented
> h3. Backup
> h4. General FT framework implementation
> Before actual backup operation starts, snapshot of a backup system table is
> taken and system table is updated with *ACTIVE_SNAPSHOT* flag. The flag will
> be removed upon backup completion.
> In case of *any* server-side failures, client catches errors/exceptions and
> handles them:
> # Cleans up backup destination (removes partial backup data)
> # Cleans up any temporary data
> # Deletes any active snapshots of a tables being backed up (during full
> backup we snapshot tables)
> # Restores backup system table from snapshot
> # Deletes backup system table snapshot (we read snapshot name from backup
> system table before)
> In case of *any* client-side failures:
> Before any backup or restore operation run we check backup system table on
> *ACTIVE_SNAPSHOT*, if flag is present, operation aborts with a message that
> backup repair tool (see below) must be run
> h4. Backup repair tool
> The command line tool *backup repair* which executes the following steps:
> # Reads info of a last failed backup session
> # Cleans up backup destination (removes partial backup data)
> # Cleans up any temporary data
> # Deletes any active snapshots of a tables being backed up (during full
> backup we snapshot tables)
> # Restores backup system table from snapshot
> # Deletes backup system table snapshot (we read snapshot name from backup
> system table before)
> h4. Detection of a partial loss of data
> h5. Full backup
> Export snapshot operation (?).
> We count files and check sizes before and after DistCp run
> h5. Incremental backup
> Conversion of WAL to HFiles, when WAL file is moved from active to archive
> directory. The code is in place to handle this situation
> During DistCp run (same as above)
> h3. Restore
> This operation does not modify backup system table and is idempotent. No
> special FT is required.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)