[jira] [Commented] (HBASE-11715) HBase should provide a tool to compare 2 remote tables.

Jean-Marc Spaggiari (JIRA) Fri, 15 Aug 2014 11:57:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-11715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098940#comment-14098940
 ]


Jean-Marc Spaggiari commented on HBASE-11715:
---------------------------------------------

{quote}
1. How is this table copied. Do we flush and just move the HFiles over.
{quote}
Copy table is not in the scope for this. This is just a tool to do the 
comparison or tables content.

{quote}
2. What do we do if they are not equivalent. Is it enough to throw an error, or 
do we need to say what part of the table isn't equivalent.
{quote}
We report the information back to the user. Like, for range A to C, content is 
different between the 2 tables.

{quote}
3. Do Merkle trees make sense for this type of thing?
{quote}
Not sure. We don't have any tree structure here.

{quote}
I am interested in working on this task. Merkle tree, we need to constantly to 
run some background service, and it will require additional amount of data.
{quote}
I don't think Merkle tree is the right option here. But you can still evaluate 
it.

{quote}
Can you provide more details, I can assign it to myself and work on this? 
{quote}
Sure! Let's go for it.

> HBase should provide a tool to compare 2 remote tables.
> -------------------------------------------------------
>
>                 Key: HBASE-11715
>                 URL: https://issues.apache.org/jira/browse/HBASE-11715
>             Project: HBase
>          Issue Type: New Feature
>          Components: util
>            Reporter: Jean-Marc Spaggiari
>
> As discussed in the mailing list, when a table is copied to another cluster 
> and need to be validated against the first one, only VerifyReplication can be 
> used. However, this can be very long since data need to be copied again.
> We should provide an easier and faster way to compare the tables. 
> One option is to calculate hashs per ranges. User can define number of 
> buckets, then we split the table into this number of buckets and calculate an 
> hash for each (Like partitioner is already doing). We can also optionally 
> calculate an overall CRC to reduce even more hash collision. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HBASE-11715) HBase should provide a tool to compare 2 remote tables.

Reply via email to