[ 
https://issues.apache.org/jira/browse/PHOENIX-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Kumar updated PHOENIX-7751:
---------------------------------
    Description: 
Start with implementing tool which can firstly just validate data b/w source 
and target cluster for given table.

Generate mapper based on region boundary on source cluster. We further chunk 
the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and 
then get target chunk based on source chunk start/endKey and compare hashes.

There would be cases where determined source chunk boundary i.e [30,80) is on 
one region but is not located within one region on target and can be split 
across multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot 
rely on comparing hashes across source and target. Because the hashes would 
look like MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on 
target vs MD5(row30 + row31 + ... + row70) on source and checksums are not 
associative. 

To handle such cases,
 - Mapper iterate over source chunk boundary based on size of chunk(or end of 
source region reached), hold the rows on server scanner and get the source 
chunk boundary like [30,80).

 - Mapper then scans target region chunk based on source chunk boundary [30,80) 
and identifies it has to issue multiple scan for each target chunk region, for 
eg. [30,40), [40,50), [50,80). Mapper computes associative hash 
MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on target

 - Source looks at target region boundaries which could be multiple like 
defined above i.e [30,40), [40,50), [50,80) and using already held rows on 
source region scanner coproc, it gets the associative just like target based on 
target region boundaries for the chunk. 

 - Now we can compare the assosiciative checksum between b/w source and target 
without even getting the data on client

  was:
Start with implementing tool which can firstly just validate data b/w source 
and target cluster for given table.

Generate mapper based on region boundary on source cluster. We further chunk 
the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and 
then get target chunk based on source chunk start/endKey and compare hashes.

There would be cases where determined source chunk boundary i.e [30,80) is on 
one region but is not located within one region on target and can be split 
across multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot 
rely on comparing hashes across source and target. Because the hashes would 
look like MD5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on 
target vs MD5(row1 + row2 + ... + rowN) on source and checksums are not 
associative. 

To handle such cases,
 - Mapper iterate over source chunk boundary based on size of chunk(or end of 
source region reached), hold the rows on server scanner and get the source 
chunk boundary like [30,80).

 - Mapper then scans target region chunk based on source chunk boundary [30,80) 
and identifies it has to issue multiple scan for each target chunk region, for 
eg. [30,40), [40,50), [50,80). Mapper computes associative hash 
D5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs 
MD5(row1 + row2 + ... + rowN)

 - Source looks at target region boundaries which could be multiple like 
defined above i.e 30,40), [40,50), [50,80) and using already held rows on 
source region scanner coproc, it gets the associative just like target based on 
target region boundaries for the chunk. 

 - Now we can compare the hash between b/w source and target without getting 
the data on client


> Feature to validate table data using PhoenixSyncTable tool b/w source and 
> target cluster
> ----------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-7751
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7751
>             Project: Phoenix
>          Issue Type: Sub-task
>    Affects Versions: 5.2.0, 5.2.1, 5.3.0
>            Reporter: Rahul Kumar
>            Assignee: Rahul Kumar
>            Priority: Major
>
> Start with implementing tool which can firstly just validate data b/w source 
> and target cluster for given table.
> Generate mapper based on region boundary on source cluster. We further chunk 
> the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and 
> then get target chunk based on source chunk start/endKey and compare hashes.
> There would be cases where determined source chunk boundary i.e [30,80) is on 
> one region but is not located within one region on target and can be split 
> across multiple region i.e [30,40), [40,50), [50,80). In such cases, we 
> cannot rely on comparing hashes across source and target. Because the hashes 
> would look like MD5(MD5(row30..row39) + MD5(row40...row49) + 
> MD5(row50...row79)) on target vs MD5(row30 + row31 + ... + row70) on source 
> and checksums are not associative. 
> To handle such cases,
>  - Mapper iterate over source chunk boundary based on size of chunk(or end of 
> source region reached), hold the rows on server scanner and get the source 
> chunk boundary like [30,80).
>  - Mapper then scans target region chunk based on source chunk boundary 
> [30,80) and identifies it has to issue multiple scan for each target chunk 
> region, for eg. [30,40), [40,50), [50,80). Mapper computes associative hash 
> MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on target
>  - Source looks at target region boundaries which could be multiple like 
> defined above i.e [30,40), [40,50), [50,80) and using already held rows on 
> source region scanner coproc, it gets the associative just like target based 
> on target region boundaries for the chunk. 
>  - Now we can compare the assosiciative checksum between b/w source and 
> target without even getting the data on client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to