[jira] Commented: (HBASE-50) Snapshot of table

HBase Review Board (JIRA) Wed, 11 Aug 2010 06:14:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897250#action_12897250
 ]

HBase Review Board commented on HBASE-50:
-----------------------------------------

Message from: "Chongxin Li" <[email protected]>

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/HTableDescriptor.java, line 673
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6002#file6002line673>
bq.  >
bq.  >     This is fine for an hbase that is a fresh install but what about 
case where the data has been migrated from an older hbase version; it won't 
have this column family in .META.  We should make a little migration script 
that adds it or on start of new version, check for it and if not present, 
create it.

That's right. But AddColumn operation requires the table disabled to proceed, 
ROOT table can not be disabled once the system is started. Then how could we 
execute the migration script or check and create it on start of new version?

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java, line 899
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6005#file6005line899>
bq.  >
bq.  >     Can the snapshot name be empty and then we'll make one up?

a default snapshot name? or a auto-generated snapshot name, such as creation 
time?

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java, line 951
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6005#file6005line951>
bq.  >
bq.  >     For restore of the snapshot, do you use loadtable.rb or Todd's new 
bulkloading scripts?

Currently, NO...
Snapshot is composed of a list of log files and a bunch of reference files for 
HFiles of the table. These reference files have the same hierarchy as the 
original table and the name is in the format of "1239384747630.tablename", 
where the front is the file name of the referred HFile and the latter is table 
name for snapshot. Thus to restore a snapshot, just copy reference files (which 
are just a few bytes) to the table dir, update the META and split the logs. 
When this table is enabled, the system know how to replay the commit edits and 
read such a reference file. Methods getReferredToFile, open in StoreFile are 
updated to deal with this kind of reference files for snapshots.

At present, snapshot can only be restored to the table whose name is the same 
as the one for which the snapshot is created. That the old table with the same 
name must be deleted before restore a snapshot. That's what I do in unit test 
TestAdmin. Restoring snapshot to a different table name has a low priority. It 
has not been implemented yet.

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/io/Reference.java, line 50
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6008#file6008line50>
bq.  >
bq.  >     Whats this?  A different kind of reference?

Yes.. This is the reference file in snapshot. It references an HFile of the 
original table.

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/SnapshotLogCleaner.java, 
line 115
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6018#file6018line115>
bq.  >
bq.  >     This looks like a class that you could write a unit test for?

Sure, I'll add another case in TestLogsCleaner.

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/RestoreSnapshot.java, line 
130
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6017#file6017line130>
bq.  >
bq.  >     If table were big, this could be prohibitively expensive?  A 
single-threaded copy of all of a tables data?  We could compliment this with 
MR-base restore, something that did the copy using MR?

This method is only used in RestoreSnapshot, where reference files of snapshot 
are copied to the table dir. These reference files just contains a few bytes 
instead of the table's data. Snapshots share the table data with the original 
table and other snapshots. Do we still need a MR job?

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/master/BaseScanner.java, line 212
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6013#file6013line212>
bq.  >
bq.  >     Why Random negative number?  Why not just leave it blank?

If a blank value is used as the key, there would be only one item at last if it 
is the first few times to scan the regions. Using random negative number 
indicates all these regions have not been scanned before. If it has been 
scanned, there would be a last checking time for it instead.

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > 
src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java, 
line 251
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6012#file6012line251>
bq.  >
bq.  >     Is this comment right?

I just renamed the Ranges to caps, comment was not changed.

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/io/Reference.java, line 149
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6008#file6008line149>
bq.  >
bq.  >     Hmm... is this good?  You are dropping some the region name when you 
toString.  Do we have to?

This has not been changed. Just rename field "region" to "range"

bq.  On 2010-08-10 21:34:40, stack wrote:
bq.  > src/main/java/org/apache/hadoop/hbase/io/Reference.java, line 156
bq.  > <http://review.cloudera.org/r/467/diff/3/?file=6008#file6008line156>
bq.  >
bq.  >     This could be a problem when fellas migrate old data to use this new 
hbase.  If there are References in the old data, then this deserialization will 
fail?  I'm fine w/ you creating a new issue named something like "Migration 
from 0.20.x hbase to 0.90" and adding a note in there that we need to consider 
this little issue.  Better though would be if the data was able to migrate 
itself at runtime; i.e. recognize a boolean on the stream and then deserialize 
the old style into the new, etc.

Actually I think it is fine to migrate old data to new hbase. Old references 
are serialized by DataOutput.writeBoolean(boolean), where value (byte)1 is 
written if the argument is true and value (byte)0 is written if argument is 
false. 

See (from Ted's review):
http://download.oracle.com/javase/1.4.2/docs/api/java/io/DataOutput.html#writeBoolean%28boolean%29

Thus value (byte)1 was written if it is the top file region (isTopFileRegion is 
true). That is exactly the same as current value of TOP. For the same reason, 
this deserialization would work for the references in the old data, right? 

That's why we can not use ordinal of Enum and serialize the int value. The 
serialization size of this field would be different for the new data and old 
data if int value is used.

- Chongxin

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/467/#review823
-----------------------------------------------------------

> Snapshot of table
> -----------------
>
>                 Key: HBASE-50
>                 URL: https://issues.apache.org/jira/browse/HBASE-50
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Billy Pearson
>            Assignee: Li Chongxin
>            Priority: Minor
>         Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
> Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class 
> Diagram.png
>
>
> Havening an option to take a snapshot of a table would be vary useful in 
> production.
> What I would like to see this option do is do a merge of all the data into 
> one or more files stored in the same folder on the dfs. This way we could 
> save data in case of a software bug in hadoop or user code. 
> The other advantage would be to be able to export a table to multi locations. 
> Say I had a read_only table that must be online. I could take a snapshot of 
> it when needed and export it to a separate data center and have it loaded 
> there and then i would have it online at multi data centers for load 
> balancing and failover.
> I understand that hadoop takes the need out of havening backup to protect 
> from failed servers, but this does not protect use from software bugs that 
> might delete or alter data in ways we did not plan. We should have a way we 
> can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-50) Snapshot of table

Reply via email to