[ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852871#action_12852871
 ] 

stack commented on HBASE-50:
----------------------------

Here are some questions received in private mail on the above:

.bq When the master gets a listing of files of the table, the all file list as 
well as WAL will be used as input for a distcp job to generate the snapshot, 
right?

The list of files does not go to the master, at least in my thinking.  We'd 
just dump them into the filesystem into a well-known place under a subdir named 
for the name of the snapshot.  But yeah, the list can then be used as input to 
the distcp job.

.bq Should compaction and split be suspended during the running of distcp?

Sort of.  I think splits and compactions and splits can be oingoing during 
snapshot but transitions -- e.g. swapping the compacted file for all the files 
compacted -- need to be temporarily suspended.  Anything that will cause 
movement of files in the filesystem.

.bq You mentioned that "The moved aside, non-deleted files might be copied 
elsewhere". What does these files refer to?

Where now, on completion of a compaction, we put in place the new file and when 
that is successful, we delete the old.  Similarly on completion of a split, 
we'll delete old regions.  An hbase that supports snapshotting never deletes.  
It never deletes because if we want to restore a snapshot, we need the old data 
doing reconstruction.  Since hbase does lots things like listing of directories 
to find the current set of data files for say a column family, we either change 
this behavior and keep a catalog of 'live' files or, instead on completion of 
compaction and splits, the old files that are no longer used should be moved 
aside -- moved to a shadow directory structure or some such (we'd need to be 
careful to ensure no name clashing).

> Snapshot of table
> -----------------
>
>                 Key: HBASE-50
>                 URL: https://issues.apache.org/jira/browse/HBASE-50
>             Project: Hadoop HBase
>          Issue Type: New Feature
>            Reporter: Billy Pearson
>            Priority: Minor
>
> Havening an option to take a snapshot of a table would be vary useful in 
> production.
> What I would like to see this option do is do a merge of all the data into 
> one or more files stored in the same folder on the dfs. This way we could 
> save data in case of a software bug in hadoop or user code. 
> The other advantage would be to be able to export a table to multi locations. 
> Say I had a read_only table that must be online. I could take a snapshot of 
> it when needed and export it to a separate data center and have it loaded 
> there and then i would have it online at multi data centers for load 
> balancing and failover.
> I understand that hadoop takes the need out of havening backup to protect 
> from failed servers, but this does not protect use from software bugs that 
> might delete or alter data in ways we did not plan. We should have a way we 
> can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to