[ 
https://issues.apache.org/jira/browse/PHOENIX-2982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332944#comment-15332944
 ] 

Junegunn Choi commented on PHOENIX-2982:
----------------------------------------

[~jamestaylor] Thanks for the comment.

bq. Are you in need of this kind of balancing due to usage our our Phoenix 
Query Server?

No, we're not yet considering Query Server. We've been running an HBase cluster 
for "semi-real-time" analytics, that is supposed to answer aggregation queries 
over a few months of data within several seconds. All the business logic is 
written in Java as Endpoint coprocessors and I'm currently evaluating Phoenix 
to see if it can effectively replace those hand-crafted coprocessors. One thing 
I noticed during the test is the problem described above, suboptimal resource 
utilization and handler exhaustion, which led me to this.

bq. Local indexes are very similar to salted tables - any thoughts around using 
the same technique for those?

I'm not familiar with that part of the code, but a cursory examination suggests 
the patch will also affect the scan over local index as it shares 
{{ParallelIterators}}. I'll see if I can get some performance numbers.

bq. we can setup our own RpcSchedulerFactory and RpcScheduler ... I'm curious 
if you've thought about this angle at all

The logic here deals with in which order the client submits scans to the 
regionservers, i.e. it tries not to put more burden on the loaded regionservers 
(technically, salt buckets). An RpcScheduler is local to regionserver and 
handles requests that are already submitted to the server, so I believe that it 
is for different types of problems. Please correct me if I'm wrong.

This patch does not change the round-robin processing of concurrent queries. It 
introduces another level of grouping inside a single query on a salted table. 
Let me try to delineate the conceptual hierarchy. Currently we use 
{{AbstractRoundRobinQueue}} to pick queries in round-robin fashion:

- AbstractRoundRobinQueue
-- Query 1
--- LinkedList of scans
-- Query 2
--- LinkedList of scans
-- ..

With the patch it becomes:

- AbstractRoundRobinQueue
-- Query 1
--- Salt bucket 1
---- LinkedList of scans
--- Salt bucket 2
---- LinkedList of scans
--- ...
-- Query 2
--- ...

A query is still picked in round-robin fashion as before (Q1 -> Q2 -> ...), but 
then we pick a scan for the query in the least-loaded salt bucket.

> Keep the number of active handlers balanced across salt buckets
> ---------------------------------------------------------------
>
>                 Key: PHOENIX-2982
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2982
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>         Attachments: PHOENIX-2982-v2.patch, PHOENIX-2982.patch, 
> cpu-util-with-patch.png, cpu-util-without-patch.png
>
>
> I'd like to discuss the idea of keeping the numbers of active handlers 
> balanced across the salt buckets during parallel scan by exposing the 
> counters to JobManager queue.
> h4. Background
> I was testing Phoenix on a 10-node test cluster. When I was running a few 
> full-scan queries simultaneously on a table whose {{SALT_BUCKETS}} is 100, I 
> noticed small queries such as {{SELECT * FROM T LIMIT 100}} are occasionally 
> blocked for up to tens of seconds due to the exhaustion of regionserver 
> handler threads.
> {{hbase.regionserver.handler.count}} was set to 100, which is much larger 
> than the default 30, so I didn't expect this to happen 
> ({{phoenix.query.threadPoolSize}} = 128 / 10 nodes * 4 queries =~ 50 < 100), 
> but from periodic thread dumps, I could observe that the numbers of active 
> handlers across the regionservers are often skewed during the execution.
> {noformat}
> # Obtained by periodic thread dumps
> # (key = regionserver ID, value = number of active handlers)
> 17:23:48: {1=>6,  2=>3,  3=>27, 4=>12, 5=>13, 6=>23, 7=>23, 8=>5,  9=>5,  
> 10=>10}
> 17:24:18: {1=>8,  2=>6,  3=>26, 4=>3,  5=>13, 6=>41, 7=>11, 8=>5,  9=>8,  
> 10=>5}
> 17:24:48: {1=>15, 3=>30, 4=>3,  5=>8,  6=>22, 7=>11, 8=>16, 9=>16, 10=>7}
> 17:25:18: {1=>6,  2=>12, 3=>37, 4=>6,  5=>4,  6=>2,  7=>21, 8=>10, 9=>24, 
> 10=>5}
> 17:25:48: {1=>4,  2=>9,  3=>48, 4=>14, 5=>2,  6=>7,  7=>18, 8=>16, 9=>2,  
> 10=>8}
> {noformat}
> Although {{ParallelIterators.submitWork}} shuffles the parallel scan tasks 
> before submitting them to {{ThreadPoolExecutor}}, there's currently no 
> mechanism to prevent the skew from happening during runtime.
> h4. Suggestion
> Maintain "active" counter for each salt bucket, and expose the numbers to 
> JobManager queue via specialized {{Producer}} implementation so that it can 
> choose a scan for the least loaded bucket.
> By doing so we can prevent the handler exhaustion problem described above and 
> can expect more consistent utilization of the resources.
> h4. Evaluation
> I measured the response times of a full-scan query with and without the 
> patch. The plan for the query is given as follows:
> {noformat}
> > explain select count(*) from image;
> +---------------------------------------------------------------------------------------------+
> |                                            PLAN                             
>                 |
> +---------------------------------------------------------------------------------------------+
> | CLIENT 2700-CHUNK 1395337221 ROWS 817889379319 BYTES PARALLEL 100-WAY FULL 
> SCAN OVER IMAGE  |
> |     SERVER FILTER BY FIRST KEY ONLY                                         
>                 |
> |     SERVER AGGREGATE INTO SINGLE ROW                                        
>                 |
> +---------------------------------------------------------------------------------------------+
> {noformat}
> Result:
> - Without patch: 260 sec
> - With patch: 220 sec
> And with the patch, CPU utilizations of the regionservers are very stable 
> during the run.
> I'd like to hear what you think about the approach. Please let me know if 
> there's any concerns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to