[ 
https://issues.apache.org/jira/browse/PHOENIX-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312638#comment-15312638
 ] 

James Taylor commented on PHOENIX-2949:
---------------------------------------

Thanks for the fix, [[email protected]]. I think this code should be reworked a 
bit - it's an attempt to calculate how much data will be scanned when there's a 
limit and decide between running the query serially versus in parallel.
{code}
        GuidePostsInfo gpsInfo = 
table.getTableStats().getGuidePosts().get(SchemaUtil.getEmptyColumnFamily(table));
        long estRowSize = SchemaUtil.estimateRowSize(table);
        long estRegionSize;
        if (gpsInfo == null) {
            // Use guidepost depth as minimum size
            ConnectionQueryServices services = 
context.getConnection().getQueryServices();
            HTableDescriptor desc = 
services.getTableDescriptor(table.getPhysicalName().getBytes());
            int guidepostPerRegion = 
services.getProps().getInt(QueryServices.STATS_GUIDEPOST_PER_REGION_ATTRIB,
                    QueryServicesOptions.DEFAULT_STATS_GUIDEPOST_PER_REGION);
            long guidepostWidth = 
services.getProps().getLong(QueryServices.STATS_GUIDEPOST_WIDTH_BYTES_ATTRIB,
                    QueryServicesOptions.DEFAULT_STATS_GUIDEPOST_WIDTH_BYTES);
            estRegionSize = 
StatisticsUtil.getGuidePostDepth(guidepostPerRegion, guidepostWidth, desc);
        } else {
            // Region size estimated based on total number of bytes divided by 
number of regions
            long totByteSize = 0;
            for (long byteCount : gpsInfo.getByteCounts()) {
                totByteSize += byteCount;
            }
            estRegionSize = totByteSize / (gpsInfo.getGuidePostsCount()+1);
        }
        // TODO: configurable number of bytes?
        boolean isSerial = (perScanLimit * estRowSize < estRegionSize);
{code}
Here are some changes I think we should make to this code:
- I don't think it's worth doing an RPC here which I believe the 
services.getTableDescriptor() call will do and  your patch would do for the 
else branch.
- The estRowSize using SchemaUtil.estimateRowSize(table) should be in the if 
statement (where we don't have stats), and otherwise should calculate based on 
the totByteSize/totalNumberOfRows where totalNumberOfRows is calculated in the 
same loop that calculates totByteSize.
- How about we create a new config, {{phoenix.query.parallel.limit.threshold}} 
which is an absolute number of bytes above which we'd run limit in parallel? If 
not specified, we can fallback to {{HConstants.HREGION_MAX_FILESIZE}}. 
Something like this:
{code}
long limitThreshold = 
    services.getProps().getLong(QueryServices.QUERY_PARALLEL_LIMIT_THRESHOLD,
        services.getProps().getLong(HConstants.HREGION_MAX_FILESIZE, 
HConstants.DEFAULT_MAX_FILE_SIZE));
{code}
- Then our check would be:
{code}
        boolean isSerial = (perScanLimit * estRowSize < limitThreshold);
{code}

> Fix estimated region size when checking for serial query
> --------------------------------------------------------
>
>                 Key: PHOENIX-2949
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2949
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>             Fix For: 4.8.0
>
>         Attachments: PHOENIX-2949.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to