[
https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338635#comment-16338635
]
Vladimir Rodionov commented on HBASE-17852:
-------------------------------------------
[~appy] wrote:
{quote}
Of the top of my head, I think the main areas to touch upon are:
- Make backups concurrent
- Use procedure framework: Long-standing request. The procv2 framework has
features like locking, queuing operations, etc. Replication is already moving
to it. I don't see a reason why backup can't too.
- Can't use CP hooks for incremental backup. Backup should/will become first
class feature - more important and critical than Coprocessor.
- There should be some basic access control, if only, limiting everything to
ADMIN (like RS group recently did in HBASE-19483)
{quote}
OK,
h4. Concurrent backups
It is doable, but ...
# Will require transaction management support - it complicates implementations
a lot. We will need to provide full isolation of operations and complex
conflict resolutions on commit. And rollback?
# Complicates testing, as well - a lot. Imagine all different possible
collisions between create, merge, delete sessions
What I suggest is a slightly different approach:
# Make restore operations concurrent
# Implement fair queuing for *create-merge-delete* sessions
# *create-merge-restore* executions will be serialized (one-by-one), but from
user's point of view they will run, kind of, in parallel.
YES/NO
h4. Use procedure framework
Short answer - no. I will wait until procv2 becomes more mature and robust. I
do not want to build new feature on a foundation of a new feature. Too risky in
my opinion. NO
h4. Can't use CP hooks for incremental backup
Currently backup lives in a separate module and we would like to keep it there.
There is no need for the tight integration of a HBase core and backup and
therefore, CP is the only our option here. NO
h4. Access control
Currently, only ADMIN can run backups/restore/delete/merge operations, but we
do not enforce this explicitly, so we should probably, do the access right
check *before* starting critical operation. YES.
[~appy], [~elserj] - comments?
> Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental
> backup)
> ------------------------------------------------------------------------------------
>
> Key: HBASE-17852
> URL: https://issues.apache.org/jira/browse/HBASE-17852
> Project: HBase
> Issue Type: Sub-task
> Reporter: Vladimir Rodionov
> Assignee: Vladimir Rodionov
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-17852-v10.patch, screenshot-1.png
>
>
> Design approach rollback-via-snapshot implemented in this ticket:
> # Before backup create/delete/merge starts we take a snapshot of the backup
> meta-table (backup system table). This procedure is lightweight because meta
> table is small, usually should fit a single region.
> # When operation fails on a server side, we handle this failure by cleaning
> up partial data in backup destination, followed by restoring backup
> meta-table from a snapshot.
> # When operation fails on a client side (abnormal termination, for example),
> next time user will try create/merge/delete he(she) will see error message,
> that system is in inconsistent state and repair is required, he(she) will
> need to run backup repair tool.
> # To avoid multiple writers to the backup system table (backup client and
> BackupObserver's) we introduce small table ONLY to keep listing of bulk
> loaded files. All backup observers will work only with this new tables. The
> reason: in case of a failure during backup create/delete/merge/restore, when
> system performs automatic rollback, some data written by backup observers
> during failed operation may be lost. This is what we try to avoid.
> # Second table keeps only bulk load related references. We do not care about
> consistency of this table, because bulk load is idempotent operation and can
> be repeated after failure. Partially written data in second table does not
> affect on BackupHFileCleaner plugin, because this data (list of bulk loaded
> files) correspond to a files which have not been loaded yet successfully and,
> hence - are not visible to the system
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)