[ 
https://issues.apache.org/jira/browse/PHOENIX-5648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004300#comment-17004300
 ] 

Vincent Poon commented on PHOENIX-5648:
---------------------------------------

In a sense, this should already be "server-side".  The IndexScrutiny is a 
MapReduce job, where the Mapper for each region should in theory get assigned 
to the regionserver/datanode hosting that region, which in turn should 
(assuming major compaction has run etc) have locality wrt the region's 
underlying hdfs blocks.

The Mapper, after reading a batch of local rows, then has to construct a query 
which checks the corresponding index rows.  Since the index values could be 
anything, this potentially results in a spray of reads across the cluster.  
Multiplied by many mappers, that can be burdensome.

I always thought something like a Merkle tree could obviate the need to 
transfer that much data.  Hash clusters of rows on both data and index tables 
using common hash function inputs, and build up a tree.  At the end, you need 
only compare the roots and then drill down to the cluster with issues, and do 
the expensive row-by-row comparison there.  There are some complications with 
this, of course.

> Improve IndexScrutinyTool's performance by moving comparison logic to server 
> side
> ---------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5648
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5648
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.15.0, 4.14.3
>            Reporter: Swaroopa Kadam
>            Assignee: Swaroopa Kadam
>            Priority: Minor
>             Fix For: 5.1.0, 4.15.1, 4.14.4, 4.16.0
>
>
> If IndexScrutinyTool runs on a table with billion rows, it takes lots of 
> time. 
> One of the ways to improve the tool is to move the comparison to the 
> server-side. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to