[ 
https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259504#comment-16259504
 ] 

Josh Elser commented on HBASE-17852:
------------------------------------

bq. I am not asking for any particular implementation, to be clear. I'm just 
trying to understand and am having trouble digesting full restore of a meta 
table whatever the size or traffic on error. It strikes me as whack (You seem 
to at least agree it 'overkill')

Got it. To clarify my previous message, by "overkill" I only mean "non-ideal". 
As in, there is likely a more complicated solution that could accomplish the 
same net-effect with less computation+time required. I didn't mean to say that 
I believed using a snapshot and table-restore is invalid or wrong. My gut 
reaction is that the number of backups which would need to be retained in the 
system (e.g. rows in the hbase backup "system" table) would have to be quite 
large to even grow beyond a single region (many thousands to millions). As 
such, the snapshot restore isn't much more than grabbing the write lock and 
replacing some one data file and some Region metadata. This is on my list today 
to investigate confirm.

To try to move the conversation forward, I tend to agree with Vlad that I don't 
seen an inherent problem with the rollback-via-snapshot implementation. 
Architecturally, Vlad is using the snapshot feature exactly how it was intended 
to be used (shallow copy and restore of a table).

bq. the idea to offline a system table and then restore from a snapshot on 
error with clients 'advised' to stop writing as some-sort of 2PC

Let's revisit this again: in the parent JIRA issue, Vlad outlined two-cases. 1) 
Recover from a "server-side" failure and 2) recover from client side failure 
(and, probably, implicitly meant to include un-handled server-side failure 
conditions too).

For #1, clients don't need to do anything special (specifically mentioned on 
the parent issue). Mutual exclusion is already built in to manage the 
serialized state in the backup "system" table. So, we're just looking at the 
cost of these steps. Offline+snapshot+online should be one of these rock-solid 
features of the system.

For #2, we're in this situation that you outline. Per the concerns you raised 
about "coordination" (the special handshake, to use another metaphor), this 
seems mitigate-able via return code of the {{hbase backup}} and a prominent 
error message in this case. I don't know if either presently exist (Could you 
comment, [~vrodionov]?).

Both of these are predicated on the mutual exclusion of multiple clients at a 
higher level. Obviously, a finer grain exclusion strategy is desirable for 
multiple reasons, but, given my current understanding, I don't see any 
fundamental problem with this approach.

> Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental 
> backup)
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-17852
>                 URL: https://issues.apache.org/jira/browse/HBASE-17852
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0-beta-1
>
>         Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, 
> HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, 
> HBASE-17852-v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to