[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

James Taylor (JIRA) Fri, 19 Sep 2014 12:00:14 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141090#comment-14141090
 ]


James Taylor commented on PHOENIX-180:
--------------------------------------

[~ramkrishna] discovered that the regions passed in aren't always consecutive 
because we filter out ones that can't possibly have data in our skip skip 
intersect code. Based on that, here's a slightly revised version:

{code}
        // Merge bisect with guideposts for all but the last region
        while (regionIndex < regionSize) {
            byte[] currentGuidePost;
            currentKey = regions.get(regionIndex).getRegionInfo().getEndKey();
            endKey = regions.get(regionIndex++).getRegionInfo().getEndKey();
            while (guideIndex < gpsSize && (Bytes.compareTo(currentGuidePost = 
gps.get(guideIndex), endKey) <= 0 || endKey.length == 0)) {
                KeyRange keyRange = KeyRange.getKeyRange(currentKey, 
currentGuidePost);
                if (keyRange != KeyRange.EMPTY_RANGE) {
                    guidePosts.add(keyRange);
                }
                currentKey = currentGuidePost;
                guideIndex++;
            }
            KeyRange keyRange = KeyRange.getKeyRange(currentKey, endKey);
            if (keyRange != KeyRange.EMPTY_RANGE) {
                guidePosts.add(keyRange);
            }
            currentKey = endKey;
        }
        if (logger.isDebugEnabled()) {
            logger.debug("The captured guideposts are: " + guidePosts);
        }
        return guidePosts;
{code}


> Use stats to guide query parallelization
> ----------------------------------------
>
>                 Key: PHOENIX-180
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-180
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: James Taylor
>            Assignee: ramkrishna.s.vasudevan
>              Labels: enhancement
>         Attachments: Phoenix-180_V1.patch, Phoenix-180_V2.patch, 
> Phoenix-180_WIP.patch, Phoenix-180_v3.patch
>
>
> We're currently not using stats, beyond a table-wide min key/max key cached 
> per client connection, to guide parallelization. If a query targets just a 
> few regions, we don't know how to evenly divide the work among threads, 
> because we don't know the data distribution. This other [issue] 
> (https://github.com/forcedotcom/phoenix/issues/64) is targeting gather and 
> maintaining the stats, while this issue is focused on using the stats.
> The main changes are:
> 1. Create a PTableStats interface that encapsulates the stats information 
> (and implements the Writable interface so that it can be serialized back from 
> the server).
> 2. Add a stats member variable off of PTable to hold this.
> 3. From MetaDataEndPointImpl, lookup the stats row for the table in the stats 
> table. If the stats have changed, return a new PTable with the updated stats 
> information. We may want to cache the stats row and have the stats gatherer 
> invalidate the cache row when updated so we don't have to always do a scan 
> for it. Additionally, it would be idea if we could use the same split policy 
> on the stats table that we use on the system table to guarantee co-location 
> of data (for the sake of caching).
> - modify the client-side parallelization (ParallelIterators.getSplits()) to 
> use this information to guide how to chunk up the scans at query time.
> This should help boost query performance, especially in cases where the data 
> is highly skewed. It's likely the cause for the slowness reported in this 
> issue: https://github.com/forcedotcom/phoenix/issues/47.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-180) Use stats to guide query parallelization

Reply via email to