[
https://issues.apache.org/jira/browse/PHOENIX-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rahul Kumar updated PHOENIX-7751:
---------------------------------
Description:
Start with implementing tool which can firstly just validate data b/w source
and target cluster for given table.
Generate mapper based on region boundary on source cluster. We further chunk
the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and
then get target chunk based on source chunk start/endKey and compare hashes.
There would be cases where determined source chunk boundary i.e [30,80) is on
one region but is not located within one region on target and can be split
across multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot
rely on comparing hashes across source and target. Because the hashes would
look like MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on
target vs MD5(row30 + row31 + ... + row70) on source and checksums are not
associative.
To handle such cases,
- Mapper iterate over source chunk boundary based on size of chunk(or end of
source region reached), hold the rows on server scanner and get the source
chunk boundary like [30,80).
- Mapper then scans target region chunk based on source chunk boundary [30,80)
and identifies it has to issue multiple scan for each target chunk region, for
eg. [30,40), [40,50), [50,80). Mapper computes associative hash
MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on target
- Source looks at target region boundaries which could be multiple like
defined above i.e [30,40), [40,50), [50,80) and using already held rows on
source region scanner coproc, it gets the associative just like target based on
target region boundaries for the chunk.
- Now we can compare the assosiciative checksum between b/w source and target
without even getting the data on client
was:
Start with implementing tool which can firstly just validate data b/w source
and target cluster for given table.
Generate mapper based on region boundary on source cluster. We further chunk
the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and
then get target chunk based on source chunk start/endKey and compare hashes.
There would be cases where determined source chunk boundary i.e [30,80) is on
one region but is not located within one region on target and can be split
across multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot
rely on comparing hashes across source and target. Because the hashes would
look like MD5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on
target vs MD5(row1 + row2 + ... + rowN) on source and checksums are not
associative.
To handle such cases,
- Mapper iterate over source chunk boundary based on size of chunk(or end of
source region reached), hold the rows on server scanner and get the source
chunk boundary like [30,80).
- Mapper then scans target region chunk based on source chunk boundary [30,80)
and identifies it has to issue multiple scan for each target chunk region, for
eg. [30,40), [40,50), [50,80). Mapper computes associative hash
D5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs
MD5(row1 + row2 + ... + rowN)
- Source looks at target region boundaries which could be multiple like
defined above i.e 30,40), [40,50), [50,80) and using already held rows on
source region scanner coproc, it gets the associative just like target based on
target region boundaries for the chunk.
- Now we can compare the hash between b/w source and target without getting
the data on client
> Feature to validate table data using PhoenixSyncTable tool b/w source and
> target cluster
> ----------------------------------------------------------------------------------------
>
> Key: PHOENIX-7751
> URL: https://issues.apache.org/jira/browse/PHOENIX-7751
> Project: Phoenix
> Issue Type: Sub-task
> Affects Versions: 5.2.0, 5.2.1, 5.3.0
> Reporter: Rahul Kumar
> Assignee: Rahul Kumar
> Priority: Major
>
> Start with implementing tool which can firstly just validate data b/w source
> and target cluster for given table.
> Generate mapper based on region boundary on source cluster. We further chunk
> the mapper region boundary with a pre-defined chunkSize(bytes/rowCount) and
> then get target chunk based on source chunk start/endKey and compare hashes.
> There would be cases where determined source chunk boundary i.e [30,80) is on
> one region but is not located within one region on target and can be split
> across multiple region i.e [30,40), [40,50), [50,80). In such cases, we
> cannot rely on comparing hashes across source and target. Because the hashes
> would look like MD5(MD5(row30..row39) + MD5(row40...row49) +
> MD5(row50...row79)) on target vs MD5(row30 + row31 + ... + row70) on source
> and checksums are not associative.
> To handle such cases,
> - Mapper iterate over source chunk boundary based on size of chunk(or end of
> source region reached), hold the rows on server scanner and get the source
> chunk boundary like [30,80).
> - Mapper then scans target region chunk based on source chunk boundary
> [30,80) and identifies it has to issue multiple scan for each target chunk
> region, for eg. [30,40), [40,50), [50,80). Mapper computes associative hash
> MD5(MD5(row30..row39) + MD5(row40...row49) + MD5(row50...row79)) on target
> - Source looks at target region boundaries which could be multiple like
> defined above i.e [30,40), [40,50), [50,80) and using already held rows on
> source region scanner coproc, it gets the associative just like target based
> on target region boundaries for the chunk.
> - Now we can compare the assosiciative checksum between b/w source and
> target without even getting the data on client
--
This message was sent by Atlassian Jira
(v8.20.10#820010)