Rahul Kumar created PHOENIX-7751:
------------------------------------
Summary: Feature to validate table data using PhoenixSyncTable
tool b/w source and target cluster
Key: PHOENIX-7751
URL: https://issues.apache.org/jira/browse/PHOENIX-7751
Project: Phoenix
Issue Type: Sub-task
Affects Versions: 5.3.0, 5.2.1, 5.2.0
Reporter: Rahul Kumar
Assignee: Rahul Kumar
Start with implementing tool which can firstly just validate data b/w source
and target cluster for given table.
Generate mapper based on region boundary on source cluster. We further chunk
the mapper region boundary with a pre-defined chunkSize(bytes/row) and then get
target chunk based on source chunk start/endKey and compare hashes.
There would be cases where source chunk boundary is on one region lets say
[30,80) but is not located within one region on target and can be split across
multiple region i.e [30,40), [40,50), [50,80). In such cases, we cannot rely on
comparing hashes across source and target. Because the hashes would look like
MD5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs
MD5(row1 + row2 + ... + rowN) on source and checksums are not associative.
To handle such cases,
- Mapper iterate over source chunk boundary based on size of chunk(or end of
source region reached), hold the rows on server scanner and get the source
chunk boundary like [30,80).
- Mapper then scans target region chunk based on source chunk boundary [30,80)
and identifies it has to issue multiple scan for each target chunk region, for
eg. [30,40), [40,50), [50,80). Mapper computes associative hash
D5(MD5(row1...row30) + MD5(row31...row60) + MD5(row61...rowN)) on target vs
MD5(row1 + row2 + ... + rowN)
- Source looks at target region boundaries which could be multiple like defined
above i.e 30,40), [40,50), [50,80) and using already held rows on source region
scanner coproc, it gets the associative just like target based on target region
boundaries for the chunk.
- Now we can compare the hash between b/w source and target without getting the
data on client
--
This message was sent by Atlassian Jira
(v8.20.10#820010)