[ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877077#action_12877077
 ] 

Li Chongxin commented on HBASE-50:
----------------------------------

bq. ...make sure you two fellas are not fighting each other regards log file 
archiving.

In current design for snapshot, only hfiles are archived. Log files are not 
archived but copied directly. However, given old log files are archived instead 
of deleted right now, I think I can back up log files by reference, just as 
hfile. When we roll the WAL log files at start of the snapshot, a list of 
references to these log files can be created. When we need split these old 
logs, we can follow the reference to find the log files. If a log file is 
removed from its original dir, then find it in the archived dir, just like 
hfiles. A LogCleanerDelegate for snapshot will be added to the chain to check 
if a log file should be deleted from the archived dir. What do you think?

bq. here is where your use of a Reference might actually come in handy. If 
snapshot directory had all References under it, perhaps, we could start against 
the snapshot directory but immediately after startup, as we do for 
half-references..

Yes,this will probably make restrore quick. Shall we create snapshot by 
Reference or just a file that lists the names of hfiles?

bq. One problem though is that regions get deleted when there are no longer 
references to the a split parent. Won't this mean you lose snapshot data? Would 
this require you to keep snapshots in a table of its own?

This scenario has been taken into account in section 6.2 'Split'. Split parent 
region will not be deleted until there are no daughter references as well as 
snapshot references to it.  (snapshot references are the references in 
'snapshot' family). MetaScanner on the master side will check both references 
in .META. This is not the reason why a 'snapshot' catalog table is required.

.META. table is in a region centric view. For each row in .META. table, it 
contains information about a single region, e.g. regioninfo, server that hosts 
the region, split daughters and hfiles that are referenced by snapshots. But 
there are some other information which are in snapshot centric view, or table 
centric view, e.g. table for which a snapshot is created, snapshot creation 
time. So snapshot catalog table is created to keep these information. But right 
now I think a meta info file created for each snapshot can do the same work. 
Snapshot will not be modified once it is created, right?

.bq ...If we switch to an asynchronous approach. Should the RS start snapshot 
immediately when it is ready? I do not follow. Please retry.

In current design, the snapshot is started and performed in a synchronous way. 
That is, if a RS is ready, it doesn't start the snapshot until all RS are 
ready. When all RS are ready, they will perform snapshot concurrently. This 
method guarantees snapshot is not started if one RS fails. But doing work 
concurrently might create too much load and impact normal operation, right?

An alternative approach is to do snapshot asynchronous. When snapshot request 
is spread over the cluster via ZK, the client returns immediately. And each RS 
starts performing snapshot once it receives the snapshot request, without 
knowing the status of other RS, just as what compaction, split does.

Which do you prefer?

> Snapshot of table
> -----------------
>
>                 Key: HBASE-50
>                 URL: https://issues.apache.org/jira/browse/HBASE-50
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Billy Pearson
>            Assignee: Li Chongxin
>            Priority: Minor
>         Attachments: HBase Snapshot Design Report V2.pdf, snapshot-src.zip
>
>
> Havening an option to take a snapshot of a table would be vary useful in 
> production.
> What I would like to see this option do is do a merge of all the data into 
> one or more files stored in the same folder on the dfs. This way we could 
> save data in case of a software bug in hadoop or user code. 
> The other advantage would be to be able to export a table to multi locations. 
> Say I had a read_only table that must be online. I could take a snapshot of 
> it when needed and export it to a separate data center and have it loaded 
> there and then i would have it online at multi data centers for load 
> balancing and failover.
> I understand that hadoop takes the need out of havening backup to protect 
> from failed servers, but this does not protect use from software bugs that 
> might delete or alter data in ways we did not plan. We should have a way we 
> can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to