Rahul Kumar created PHOENIX-7751:
------------------------------------

             Summary: Feature to validate table data using PhoenixSyncTable 
tool b/w source and target cluster
                 Key: PHOENIX-7751
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7751
             Project: Phoenix
          Issue Type: Sub-task
    Affects Versions: 5.3.0, 5.2.1, 5.2.0
            Reporter: Rahul Kumar
            Assignee: Rahul Kumar


Start with implementing tool which can firstly just validate data b/w source 
and target cluster for given table.

Generate mapper based on region boundary on source cluster. We further chunk 
the mapper region boundary with a pre-defined chunkSize(bytes/row) and then get 
target chunk based on source chunk start/endKey and compare hashes.

There would be cases where source chunk boundary is on one region lets say 
[30,80) but is not located within one region on target and can be split across 
multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot rely on 
comparing hashes across source and target. Because the hashes would look like 
MD5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs 
MD5(row1 + row2 + ... + rowN) on source and checksums are not associative. 

To handle such cases, 
- Mapper iterate over source chunk boundary based on size of chunk(or end of 
source region reached), hold the rows on server scanner and get the source 
chunk boundary like [30,80).

- Mapper then scans target region chunk based on source chunk boundary [30,80) 
and identifies it has to issue multiple scan for each target chunk region, for 
eg. [30,40), [40,50), [50,80). Mapper computes associative hash 
D5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs 
MD5(row1 + row2 + ... + rowN)

- Source looks at target region boundaries which could be multiple like defined 
above i.e 30,40), [40,50), [50,80) and using already held rows on source region 
scanner coproc, it gets the associative just like target based on target region 
boundaries for the chunk. 

- Now we can compare the hash between b/w source and target without getting the 
data on client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to