[
https://issues.apache.org/jira/browse/PHOENIX-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004300#comment-17004300
]
Vincent Poon commented on PHOENIX-5648:
---------------------------------------
In a sense, this should already be "server-side". The IndexScrutiny is a
MapReduce job, where the Mapper for each region should in theory get assigned
to the regionserver/datanode hosting that region, which in turn should
(assuming major compaction has run etc) have locality wrt the region's
underlying hdfs blocks.
The Mapper, after reading a batch of local rows, then has to construct a query
which checks the corresponding index rows. Since the index values could be
anything, this potentially results in a spray of reads across the cluster.
Multiplied by many mappers, that can be burdensome.
I always thought something like a Merkle tree could obviate the need to
transfer that much data. Hash clusters of rows on both data and index tables
using common hash function inputs, and build up a tree. At the end, you need
only compare the roots and then drill down to the cluster with issues, and do
the expensive row-by-row comparison there. There are some complications with
this, of course.
> Improve IndexScrutinyTool's performance by moving comparison logic to server
> side
> ---------------------------------------------------------------------------------
>
> Key: PHOENIX-5648
> URL: https://issues.apache.org/jira/browse/PHOENIX-5648
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 5.0.0, 4.15.0, 4.14.3
> Reporter: Swaroopa Kadam
> Assignee: Swaroopa Kadam
> Priority: Minor
> Fix For: 5.1.0, 4.15.1, 4.14.4, 4.16.0
>
>
> If IndexScrutinyTool runs on a table with billion rows, it takes lots of
> time.
> One of the ways to improve the tool is to move the comparison to the
> server-side.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)