[
https://issues.apache.org/jira/browse/PHOENIX-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312638#comment-15312638
]
James Taylor commented on PHOENIX-2949:
---------------------------------------
Thanks for the fix, [[email protected]]. I think this code should be reworked a
bit - it's an attempt to calculate how much data will be scanned when there's a
limit and decide between running the query serially versus in parallel.
{code}
GuidePostsInfo gpsInfo =
table.getTableStats().getGuidePosts().get(SchemaUtil.getEmptyColumnFamily(table));
long estRowSize = SchemaUtil.estimateRowSize(table);
long estRegionSize;
if (gpsInfo == null) {
// Use guidepost depth as minimum size
ConnectionQueryServices services =
context.getConnection().getQueryServices();
HTableDescriptor desc =
services.getTableDescriptor(table.getPhysicalName().getBytes());
int guidepostPerRegion =
services.getProps().getInt(QueryServices.STATS_GUIDEPOST_PER_REGION_ATTRIB,
QueryServicesOptions.DEFAULT_STATS_GUIDEPOST_PER_REGION);
long guidepostWidth =
services.getProps().getLong(QueryServices.STATS_GUIDEPOST_WIDTH_BYTES_ATTRIB,
QueryServicesOptions.DEFAULT_STATS_GUIDEPOST_WIDTH_BYTES);
estRegionSize =
StatisticsUtil.getGuidePostDepth(guidepostPerRegion, guidepostWidth, desc);
} else {
// Region size estimated based on total number of bytes divided by
number of regions
long totByteSize = 0;
for (long byteCount : gpsInfo.getByteCounts()) {
totByteSize += byteCount;
}
estRegionSize = totByteSize / (gpsInfo.getGuidePostsCount()+1);
}
// TODO: configurable number of bytes?
boolean isSerial = (perScanLimit * estRowSize < estRegionSize);
{code}
Here are some changes I think we should make to this code:
- I don't think it's worth doing an RPC here which I believe the
services.getTableDescriptor() call will do and your patch would do for the
else branch.
- The estRowSize using SchemaUtil.estimateRowSize(table) should be in the if
statement (where we don't have stats), and otherwise should calculate based on
the totByteSize/totalNumberOfRows where totalNumberOfRows is calculated in the
same loop that calculates totByteSize.
- How about we create a new config, {{phoenix.query.parallel.limit.threshold}}
which is an absolute number of bytes above which we'd run limit in parallel? If
not specified, we can fallback to {{HConstants.HREGION_MAX_FILESIZE}}.
Something like this:
{code}
long limitThreshold =
services.getProps().getLong(QueryServices.QUERY_PARALLEL_LIMIT_THRESHOLD,
services.getProps().getLong(HConstants.HREGION_MAX_FILESIZE,
HConstants.DEFAULT_MAX_FILE_SIZE));
{code}
- Then our check would be:
{code}
boolean isSerial = (perScanLimit * estRowSize < limitThreshold);
{code}
> Fix estimated region size when checking for serial query
> --------------------------------------------------------
>
> Key: PHOENIX-2949
> URL: https://issues.apache.org/jira/browse/PHOENIX-2949
> Project: Phoenix
> Issue Type: Bug
> Reporter: Ankit Singhal
> Assignee: Ankit Singhal
> Fix For: 4.8.0
>
> Attachments: PHOENIX-2949.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)