[ 
https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607374#comment-16607374
 ] 

Bin Shi commented on PHOENIX-4594:
----------------------------------

It seems that we have more issues in Phoenix Stats.
 # In BaseResultIterators.getParallelScans(...), we use linear search in guide 
posts to find where the intersection begins – this is the one described in this 
Jira and previous comments. The solution "make a pass through all guideposts 
and put them into a List in which we can perform a binary search" has one 
downside (besides memory issue), we could do unnecessary work to decode the 
guide posts which exceed the Scan Range, whereas the original code doesn't have 
this problem. If this unnecessary work is expensive in some cases, we could 
decode guide posts in batches and use binary searches in moving window to find 
where the intersection begins. I won't go that far for the time being, because 
I haven't seen that it is useful when GUIDE_POST_WIDTH below 100MB.
 # In BaseResultIterators.getParallelScans(...), for all the cases (serial plan 
or not, point lookup or not, explain plan or a real query executed on the 
server side, useStatsForParallelization flag turned on or off), we always go 
through guide posts, collect estimation (on # rows and size) and create scans 
based on guide posts, but we should deliberately differentiate among these 
cases – for example, for point lookup, it isn't necessary to collect the 
estimation based on guide posts because it isn't used in the future, and 
generating scans on region level is enough; another example, if 
useStatsForParallelization is turned off, we may collect the estimation based 
on guide posts, but generating scans on region level is enough. 
 # 
Regarding overall Phoenix Stats design, while I do think Phoenix Stats is 
useful for Query Complexity Estimation and Query Optimization, according to 
[Statistics Collection|https://phoenix.apache.org/update_statistics.html] (see 
“parallelization” section) on Phoenix website, Phoenix Stats is designed for 
providing a means of gaining intra-region parallelization (thus increase the 
performance and reduce query latency). *I doubt that Phoenix Stats is a good 
design for achieving this goal, but I could miss the context which results in 
the misunderstanding.*
 
In my understanding, the current design works well only when a region server is 
processing one or few queries and overall load is light - thus increasing the 
parallelization inside of a query by assigning each chunk of data between 
guideposts to a thread/handler in a separate scan helps on the 
performance.{color:#000000} When the overall load on a region server is high 
due to high resource consumption or multiple queries being processed on the 
server, increasing the parallelization inside of a query could lead to higher 
system overhead due to context switching and L1 cache miss when we increase the 
# of threads/handlers on the region server (it should still bound by the # of 
CPU cores) or lead to unpredictable latency when threads/handlers are saturated 
and few pieces of scans wait for free threads/handlers (), and eventfully hurt 
performance instead of having performance gain.{color}
{color:#000000} {color}
{color:#000000}According to the above, whether or not the current design works 
well depends on the access pattern of the load and the scenarios, but 
processing multiple queries and having high or varying load on region servers 
should be very common.{color}
 # {color:#000000}Because of 3., in Phoenix Stats, we might need to achieve 
high degree of palatalization in two levels. Client side, where the compilation 
and query optimization happens, decides parallel scan on region level, and each 
individual region server decides intra-region level parallelization based on 
the current load and guide post info.  {color}

> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4594
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4594
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Abhishek Singh Chouhan
>            Priority: Major
>
> If there are many guideposts, performance will suffer during query 
> compilation because we do a linear search of the guideposts to find the 
> intersection with the scan ranges. Instead, in 
> BaseResultIterators.getParallelScans() we should populate an array of 
> guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to