[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

James Taylor (JIRA) Mon, 25 Aug 2014 09:29:01 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109270#comment-14109270
 ]


James Taylor commented on PHOENIX-180:
--------------------------------------

bq. Do we have provision to do that from the scripts itself?
You can either add this to the performance.py script or bring up sqlline and 
issue the query.

bq. Would like to know on what basis should guide posts be collected?
Collect after n bytes of kvs where n is configurable. The other configurable 
parameter would be how often the MetaDataEndpointImpl would re-query the stats 
table. Also, given that we have a way of analyzing stats manually, we should 
make sure that this also invalidates the metadata cache entry (just for that 
table). We have a way of clearing the entire cache, just need to have a new 
method to invalidate the cache entry for a single table.

FWIW, if creating 10M rows takes too long, you can probably dial that down to 
3M and hopefully still see a perf difference.

> Use stats to guide query parallelization
> ----------------------------------------
>
>                 Key: PHOENIX-180
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-180
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>              Labels: enhancement
>         Attachments: Phoenix-180_WIP.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached 
> per client connection, to guide parallelization. If a query targets just a 
> few regions, we don't know how to evenly divide the work among threads, 
> because we don't know the data distribution. This other [issue] 
> (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather and 
> maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information 
> (and implements the Writable interface so that it can be serialized back from 
> the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats 
> table. If the stats have changed, return a new PTable with the updated stats 
> information. We may want to cache the stats row and have the stats gatherer 
> invalidate the cache row when updated so we don't have to always do a scan 
> for it. Additionally, it would be idea if we could use the same split policy 
> on the stats table that we use on the system table to guarantee co-location 
> of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to 
> use this information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data 
> is highly skewed. It's likely the cause for the slowness reported in this 
> issue: https://github.com/forcedotcom/phoenix/issues/47.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

Reply via email to