[
https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15337974#comment-15337974
]
ASF GitHub Bot commented on DRILL-4530:
---------------------------------------
Github user amansinha100 commented on a diff in the pull request:
https://github.com/apache/drill/pull/519#discussion_r67602350
--- Diff:
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
---
@@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter
filterRel, Project projectR
int recordCount = 0;
int qualifiedCount = 0;
- // Inner loop: within each batch iterate over the
PartitionLocations
- for(PartitionLocation part: partitions){
- if(!output.getAccessor().isNull(recordCount) &&
output.getAccessor().get(recordCount) == 1){
- newPartitions.add(part);
- qualifiedCount++;
+ if (checkForSingle &&
+ partitions.get(0).isCompositePartition() /* apply single
partition check only for composite partitions */) {
+ // Inner loop: within each batch iterate over the
PartitionLocations
+ for (PartitionLocation part : partitions) {
+ assert part.isCompositePartition();
+ if(!output.getAccessor().isNull(recordCount) &&
output.getAccessor().get(recordCount) == 1) {
+ newPartitions.add(part);
+ if (isSinglePartition) { // only need to do this if we are
already single partition
+ // compose the array of partition values for the
directories that are referenced by filter:
+ // e.g suppose the dir hierarchy is year/quarter/month and
the query is:
+ // SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
+ // then for 2015/Q1/Feb, this will have ['2015', 'Q1',
null]
+ // Note that we are not using the PartitionLocation here
but composing a different list because
+ // we are only interested in the directory columns that
are referenced in the filter condition. not
+ // the SELECT list or other parts of the query.
+ Pair<String[], Integer> p =
composePartition(referencedDirsBitSet, partitionMap, vectors, recordCount);
+ String[] parts = p.getLeft();
+ int tmpIndex = p.getRight();
+ if (spInfo == null) {
+ spInfo = parts;
+ maxIndex = tmpIndex;
+ } else if (maxIndex != tmpIndex) {
+ isSinglePartition = false;
+ break;
+ } else {
+ // we only want to compare until the maxIndex inclusive
since subsequent values would be null
+ for (int j = 0; j <= maxIndex; j++) {
+ if (spInfo[j] == null // prefixes should be non-null
--- End diff --
If the query has for example WHERE dir2 = 'January' and does not have any
condition on dir0 or dir1, then the spInfo array itself will be non-null but
will have null elements in it: [null, null, 'January'] and maxIndex = 2. In
this case, the single partition optimization should not be applied. I can add a
comment here to explain.
> Improve metadata cache performance for queries with single partition
> ---------------------------------------------------------------------
>
> Key: DRILL-4530
> URL: https://issues.apache.org/jira/browse/DRILL-4530
> Project: Apache Drill
> Issue Type: Improvement
> Components: Query Planning & Optimization
> Affects Versions: 1.6.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Fix For: 1.7.0
>
>
> Consider two types of queries which are run with Parquet metadata caching:
> {noformat}
> query 1:
> SELECT col FROM `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2
> elapsed time is 9 sec even though both are accessing the same amount of data.
> The user expectation is that they should perform roughly the same. The main
> difference comes from reading the bigger metadata cache file at the root
> level 'A' for query2 and then applying the partitioning filter. query1 reads
> a much smaller metadata cache file at the subdirectory level.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)