[ 
https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626137#comment-16626137
 ] 

Bin Shi commented on PHOENIX-4594:
----------------------------------

[~lhofhansl], this patch is NOT for "have a parallelism target and combine 
guideposts as needed". This patch is just a small optimization for replacing 
the linear search by binary search of the guideposts to find the intersection 
with the scan ranges. For example, assume we have guide posts "Region 0(gp0, 
gp1, gp2); Region 1(gp3, gp4, gp5);...; Region 100 (gp300, gp301, gp302); 
Region 101 (gp303, gp304, gp305); ..." and the scan range for a query "select * 
from table where pk >= x AND pk <= y" begins at gp301 in region 100 and ends at 
gp304 in region 101, this patch uses binary search to find gp300, the first 
guide post in region 100. Because I don't want to decode and load all the guide 
posts to memory then perform binary search, to reduce memory footprint and 
eliminate the cost of decoding/loading guide posts after gp304, I use a moving 
window in the change. The algorithm is commented in the code as below:

_Continuously decode and load guide posts in batches (moving window). For each 
moving window, firstly compare the searching key with the last element to see 
whether the searching key is in the current window. If it isn't, perform binary 
search in the window; otherwise, move to the next window and repeat the above 
steps until it finds the start key of the first region in the scan ranges or 
its insertion position._

As showed by my previous comments and internal discussions, I agree that "the 
main problem is – with the current guideposts is that the number of scans that 
are generated and executed by a query are locked with the number of guideposts 
that exist". More specifically, this problem can be split into the following 
two problems:
 # For query complexity estimation and query optimization, we need to go 
through all guide posts sequentially to collect estimated rows and size. The 
time complexity is O(n), where n is the total count of guide posts. When n is 
bigger, the performance gets worse. That is why I eventually want to change 
guide post info data structure to a SUM tree with leaf node being a chunk of 
guide posts in prefix encoding, so the time complexity can be reduced to 
O(logk), where k is the count of guide post chunks. The time complexity can 
even be reduced to O(1) for full scans.
 # The granularity of a scan is the same as that of a guide post. Today we 
don't have this problem, because we have disabled intra-region parallel scans 
so the granularity of a scan now is a region. To support intra-region parallel 
scans, we need what you said "disentangle the guidepost data from the actually 
materialized scans". And the version on Sept. 15th of my doc [Providing 
Multi-tiers Read Parallelization in Phoenix 
Stats|https://docs.google.com/document/d/1t0rIwqlkz62eP0FQ1MjjF_wYo2yQYHZ0I9qEqEf5tfU/edit#heading=h.s01zivx444h4]
 described the similar idea to what you said "As super simple solution would be 
to always group N guideposts together, where N might be configurable (perhaps 
per query or something)" – what I said is "Another alternative to this 
“proposal” is to define the minimal intra-region scan length which should be 
integral multiple of the defined GUIDE_POSTS_WIDTH value and still let the 
client side generate intra-region parallel scans as what today is. This 
approach is less aggressive and introduce much less changes into the current 
design and implementation." I have restored that version.

I have held this patch, because I want to use the right solutions to address 
the above two problems directly. 

> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4594
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4594
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Bin Shi
>            Priority: Major
>         Attachments: PHOENIX-4594-0913.patch, PHOENIX-4594_0917.patch, 
> PHOENIX-4594_0918.patch
>
>
> If there are many guideposts, performance will suffer during query 
> compilation because we do a linear search of the guideposts to find the 
> intersection with the scan ranges. Instead, in 
> BaseResultIterators.getParallelScans() we should populate an array of 
> guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to