[ https://issues.apache.org/jira/browse/HBASE-17852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269756#comment-16269756 ]
Vladimir Rodionov commented on HBASE-17852: ------------------------------------------- {quote} Why do this? Why not just mark the backup as corrupt and move on? (Why does an incomplete back-up freeze all backups – which you say above .... I'm trying to understand). {quote} I have explained this many times already ... Restoring meta table in case of a backup failure is a necessary step to make future backups possible. We write some data during backup create, which is safe only of backup succeeds, such as last WAL roll timestamp per table-per RS. If backup fails, this data becomes corrupt w/o restoring meta table from snapshot. {quote} What if its a cron job? Does this inability at moving on past failure make it so backup cannot be cron'd? {quote} Running backup repair automatically in case of a backup failure won't hurt and can be incorporated into cron job {quote} If we weren't snapshotting/restoring the backup table, we wouldn't have to make a separate table to hold bulkloaded files? Is that so? (I'm not asking for a rewrite...). {quote} Yes, correct. {quote} I am asking questions to try and understand what is going on in here. When the response is terse or lean on info, I'm going to ask another question... and so on. As to whether snapshot/restore of the meta backup table is 'bad' or not, I'm still trying to understand why we would go to the extreme of offlining a whole table – even though rare when in error and then it seems, this offlining is making it so we have to add yet another table just to hold bulk loaded files... Pardon my being slow. {quote} Yes, the second table has been added long after the initial implementation was complete as a result of hardening bulk load support feature. You may consider this a s work-around, but it is pretty lightweight work-around. W/o snapshots, we have to make all the changes to meta table fully transactional ones. I think it is much harder. > Add Fault tolerance to HBASE-14417 (Support bulk loaded files in incremental > backup) > ------------------------------------------------------------------------------------ > > Key: HBASE-17852 > URL: https://issues.apache.org/jira/browse/HBASE-17852 > Project: HBase > Issue Type: Sub-task > Reporter: Vladimir Rodionov > Assignee: Vladimir Rodionov > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-17852-v1.patch, HBASE-17852-v2.patch, > HBASE-17852-v3.patch, HBASE-17852-v4.patch, HBASE-17852-v5.patch, > HBASE-17852-v6.patch, HBASE-17852-v7.patch, HBASE-17852-v8.patch > > > Design approach rollback-via-snapshot implemented in this ticket: > # Before backup create/delete/merge starts we take a snapshot of the backup > meta-table (backup system table). This procedure is lightweight because meta > table is small, usually should fit a single region. > # When operation fails on a server side, we handle this failure by cleaning > up partial data in backup destination, followed by restoring backup > meta-table from a snapshot. > # When operation fails on a client side (abnormal termination, for example), > next time user will try create/merge/delete he(she) will see error message, > that system is in inconsistent state and repair is required, he(she) will > need to run backup repair tool. > # To avoid multiple writers to the backup system table (backup client and > BackupObserver's) we introduce small table ONLY to keep listing of bulk > loaded files. All backup observers will work only with this new tables. The > reason: in case of a failure during backup create/delete/merge/restore, when > system performs automatic rollback, some data written by backup observers > during failed operation may be lost. This is what we try to avoid. > # Second table keeps only bulk load related references. We do not care about > consistency of this table, because bulk load is idempotent operation and can > be repeated after failure. Partially written data in second table does not > affect on BackupHFileCleaner plugin, because this data (list of bulk loaded > files) correspond to a files which have not been loaded yet successfully and, > hence - are not visible to the system -- This message was sent by Atlassian JIRA (v6.4.14#64029)