[ 
https://issues.apache.org/jira/browse/HBASE-13639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216169#comment-15216169
 ] 

Dave Latham commented on HBASE-13639:
-------------------------------------

Sorry for the lack of better documentation, [~abhishek_soni].  Thanks for 
bringing it up.  I'll try to provide a better explanation.  You may have 
already seen it, but if not, the design doc linked in the description above may 
also give you some better clues as to how it should be used.

Briefly, the feature is intended to start with a pair of tables in remote 
clusters that are already substantially similar and make them identical by 
comparing hashes of the data and copying only the diffs instead of having to 
copy the entire table.  So it is targeted at a very specific use case (with 
some work it could generalize to cover things like CopyTable and 
VerifyRepliaction but it's not there yet).  To use it, you choose one table to 
be the "source", and the other table is the "target".  After the process is 
complete the target table should end up being identical to the source table.

- In the source table's cluster, run 
org.apache.hadoop.hbase.mapreduce.HashTable and pass it the name of the source 
table and an output directory in HDFS.  HashTable will scan the source table, 
break the data up into row key ranges (default of 8kB per range) and produce a 
hash of the data for each range.
- Make the hashes available to the target cluster - I'd recommend using DistCp 
to copy it across.
- In the target table's cluster, run 
org.apache.hadoop.hbase.mapreduce.SyncTable and pass it the directory where you 
put the hashes, and the names of the source and destination tables.  You will 
likely also need to specify the source table's ZK quorum via the 
--sourcezkcluster option.  SyncTable will then read the hash information, and 
compute the hashes of the same row ranges for the target table.  For any row 
range where the hash fails to match, it will open a remote scanner to the 
source table, read the data for that range, and do Puts and Deletes to the 
target table to update it to match the source.

I hope that clarifies it a bit.  Let me know if you need a hand.  If anyone 
wants to work on getting some documentation into the book, I can try to write 
some more but would love a hand on turning it into an actual book patch.




> SyncTable - rsync for HBase tables
> ----------------------------------
>
>                 Key: HBASE-13639
>                 URL: https://issues.apache.org/jira/browse/HBASE-13639
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, Operability, tooling
>            Reporter: Dave Latham
>            Assignee: Dave Latham
>              Labels: tooling
>             Fix For: 2.0.0, 0.98.14, 1.2.0
>
>         Attachments: HBASE-13639-0.98-addendum-hadoop-1.patch, 
> HBASE-13639-0.98.patch, HBASE-13639-v1.patch, HBASE-13639-v2.patch, 
> HBASE-13639-v3-0.98.patch, HBASE-13639-v3.patch, HBASE-13639.patch
>
>
> Given HBase tables in remote clusters with similar but not identical data, 
> efficiently update a target table such that the data in question is identical 
> to a source table.  Efficiency in this context means using far less network 
> traffic than would be required to ship all the data from one cluster to the 
> other.  Takes inspiration from rsync.
> Design doc: 
> https://docs.google.com/document/d/1-2c9kJEWNrXf5V4q_wBcoIXfdchN7Pxvxv1IO6PW0-U/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to