[jira] [Updated] (HBASE-15227) HBase Backup Phase 3: Fault tolerance (client/server) support

Vladimir Rodionov (JIRA) Fri, 21 Apr 2017 10:49:19 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Rodionov updated HBASE-15227:
--------------------------------------
    Description: 
System must be tolerant to faults: 

# Backup operations MUST be atomic (no partial completion state in the backup 
system table)
# Process must detect any type of failures which can result in a data loss 
(partial backup or partial restore) 
# Proper system table state restore and cleanup must be done in case of a 
failure
# Additional utility to repair backup system table and corresponding file 
system cleanup must be implemented

h3. Backup

h4. General FT framework implementation 

Before actual backup operation starts, snapshot of a backup system table is 
taken and system table is updated with *ACTIVE_SNAPSHOT* flag. The flag will be 
removed upon backup completion. 

In case of *any* server-side failures, client catches errors/exceptions and 
handles them:

# Cleans up backup destination (removes partial backup data)
# Cleans up any temporary data
# Deletes  any active snapshots of a tables being backed up (during full backup 
we snapshot tables)
# Restores backup system table from snapshot
# Deletes backup system table snapshot (we read snapshot name from backup 
system table before)

In case of *any* client-side failures:

Before any backup or restore operation run we check backup system table on 
*ACTIVE_SNAPSHOT*, if flag is present, operation aborts with a message that 
backup repair tool (see below) must be run

h4. Backup repair tool

The command line tool *backup repair* which executes the following steps:

# Reads info of a last failed backup session
# Cleans up backup destination (removes partial backup data)
# Cleans up any temporary data
# Deletes  any active snapshots of a tables being backed up (during full backup 
we snapshot tables)
# Restores backup system table from snapshot
# Deletes backup system table snapshot (we read snapshot name from backup 
system table before)

h4. Detection of a partial loss of data

h5. Full backup  

Export snapshot operation (?).

We count files and check sizes before and after DistCp run

h5. Incremental backup 

Conversion of WAL to HFiles, when WAL file is moved from active to archive 
directory. The code is in place to handle this situation

During DistCp run (same as above)

h3. Restore

This operation does not modify backup system table and is idempotent. No 
special FT is required.   


 
     

  was:
System must be tolerant to faults: 

# Backup operations MUST be atomic (no partial completion state in the backup 
system table)
# Process must detect any type of failures which can result in a data loss 
(partial backup or partial restore) 
# Proper system table state restore and cleanup must be done in case of a 
failure
# Additional utility to repair backup system table and corresponding file 
system cleanup must be implemented


> HBase Backup Phase 3: Fault tolerance (client/server) support
> -------------------------------------------------------------
>
>                 Key: HBASE-15227
>                 URL: https://issues.apache.org/jira/browse/HBASE-15227
>             Project: HBase
>          Issue Type: Task
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>            Priority: Blocker
>              Labels: backup
>             Fix For: 2.0.0
>
>         Attachments: HBASE-15227-v3.patch, HBASE-15277-v1.patch
>
>
> System must be tolerant to faults: 
> # Backup operations MUST be atomic (no partial completion state in the backup 
> system table)
> # Process must detect any type of failures which can result in a data loss 
> (partial backup or partial restore) 
> # Proper system table state restore and cleanup must be done in case of a 
> failure
> # Additional utility to repair backup system table and corresponding file 
> system cleanup must be implemented
> h3. Backup
> h4. General FT framework implementation 
> Before actual backup operation starts, snapshot of a backup system table is 
> taken and system table is updated with *ACTIVE_SNAPSHOT* flag. The flag will 
> be removed upon backup completion. 
> In case of *any* server-side failures, client catches errors/exceptions and 
> handles them:
> # Cleans up backup destination (removes partial backup data)
> # Cleans up any temporary data
> # Deletes  any active snapshots of a tables being backed up (during full 
> backup we snapshot tables)
> # Restores backup system table from snapshot
> # Deletes backup system table snapshot (we read snapshot name from backup 
> system table before)
> In case of *any* client-side failures:
> Before any backup or restore operation run we check backup system table on 
> *ACTIVE_SNAPSHOT*, if flag is present, operation aborts with a message that 
> backup repair tool (see below) must be run
> h4. Backup repair tool
> The command line tool *backup repair* which executes the following steps:
> # Reads info of a last failed backup session
> # Cleans up backup destination (removes partial backup data)
> # Cleans up any temporary data
> # Deletes  any active snapshots of a tables being backed up (during full 
> backup we snapshot tables)
> # Restores backup system table from snapshot
> # Deletes backup system table snapshot (we read snapshot name from backup 
> system table before)
> h4. Detection of a partial loss of data
> h5. Full backup  
> Export snapshot operation (?).
> We count files and check sizes before and after DistCp run
> h5. Incremental backup 
> Conversion of WAL to HFiles, when WAL file is moved from active to archive 
> directory. The code is in place to handle this situation
> During DistCp run (same as above)
> h3. Restore
> This operation does not modify backup system table and is idempotent. No 
> special FT is required.   
>  
>      



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HBASE-15227) HBase Backup Phase 3: Fault tolerance (client/server) support

Reply via email to