[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259504#comment-16259504 ]
Josh Elser commented on HBASE-17852: ------------------------------------ bq. I am not asking for any particular implementation, to be clear. I'm just trying to understand and am having trouble digesting full restore of a meta table whatever the size or traffic on error. It strikes me as whack (You seem to at least agree it 'overkill') Got it. To clarify my previous message, by "overkill" I only mean "non-ideal". As in, there is likely a more complicated solution that could accomplish the same net-effect with less computation+time required. I didn't mean to say that I believed using a snapshot and table-restore is invalid or wrong. My gut reaction is that the number of backups which would need to be retained in the system (e.g. rows in the hbase backup "system" table) would have to be quite large to even grow beyond a single region (many thousands to millions). As such, the snapshot restore isn't much more than grabbing the write lock and replacing some one data file and some Region metadata. This is on my list today to investigate confirm. To try to move the conversation forward, I tend to agree with Vlad that I don't seen an inherent problem with the rollback-via-snapshot implementation. Architecturally, Vlad is using the snapshot feature exactly how it was intended to be used (shallow copy and restore of a table). bq. the idea to offline a system table and then restore from a snapshot on error with clients 'advised' to stop writing as some-sort of 2PC Let's revisit this again: in the parent JIRA issue, Vlad outlined two-cases. 1) Recover from a "server-side" failure and 2) recover from client side failure (and, probably, implicitly meant to include un-handled server-side failure conditions too). For #1, clients don't need to do anything special (specifically mentioned on the parent issue). Mutual exclusion is already built in to manage the serialized state in the backup "system" table. So, we're just looking at the cost of these steps. Offline+snapshot+online should be one of these rock-solid features of the system. For #2, we're in this situation that you outline. Per the concerns you raised about "coordination" (the special handshake, to use another metaphor), this seems mitigate-able via return code of the {{hbase backup}} and a prominent error message in this case. I don't know if either presently exist (Could you comment, [~vrodionov]?). Both of these are predicated on the mutual exclusion of multiple clients at a higher level. Obviously, a finer grain exclusion strategy is desirable for multiple reasons, but, given my current understanding, I don't see any fundamental problem with this approach. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > ------------------------------------------------------------------------------------ > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task > Reporter: Vladimir Rodionov > Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)