[ https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326679#comment-15326679 ]
ASF GitHub Bot commented on DRILL-4530: --------------------------------------- GitHub user amansinha100 opened a pull request: https://github.com/apache/drill/pull/519 DRILL-4530: Optimize partition pruning with metadata caching for the … …single partition case. - Enhance PruneScanRule to detect single partitions based on referenced dirs in the filter. - Keep a new status of EXPANDED_PARTIAL for FileSelection. - Create separate .directories metadata file to prune directories first before files. - Introduce cacheFileRoot attribute to keep track of the parent directory of the cache file after partition pruning. You can merge this pull request into a Git repository by running: $ git pull https://github.com/amansinha100/incubator-drill DRILL-4530-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/519.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #519 ---- commit 9c9687e804fa05c8f4b7b065738c458cb88bf5c4 Author: Aman Sinha <asi...@maprtech.com> Date: 2016-03-25T19:55:59Z DRILL-4530: Optimize partition pruning with metadata caching for the single partition case. - Enhance PruneScanRule to detect single partitions based on referenced dirs in the filter. - Keep a new status of EXPANDED_PARTIAL for FileSelection. - Create separate .directories metadata file to prune directories first before files. - Introduce cacheFileRoot attribute to keep track of the parent directory of the cache file after partition pruning. ---- > Improve metadata cache performance for queries with single partition > --------------------------------------------------------------------- > > Key: DRILL-4530 > URL: https://issues.apache.org/jira/browse/DRILL-4530 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning & Optimization > Affects Versions: 1.6.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > Fix For: 1.7.0 > > > Consider two types of queries which are run with Parquet metadata caching: > {noformat} > query 1: > SELECT col FROM `A/B/C`; > query 2: > SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C'; > {noformat} > For a certain dataset, the query1 elapsed time is 1 sec whereas query2 > elapsed time is 9 sec even though both are accessing the same amount of data. > The user expectation is that they should perform roughly the same. The main > difference comes from reading the bigger metadata cache file at the root > level 'A' for query2 and then applying the partitioning filter. query1 reads > a much smaller metadata cache file at the subdirectory level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)