[ 
https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326679#comment-15326679
 ] 

ASF GitHub Bot commented on DRILL-4530:
---------------------------------------

GitHub user amansinha100 opened a pull request:

    https://github.com/apache/drill/pull/519

    DRILL-4530: Optimize partition pruning with metadata caching for the …

    …single partition case.
    
     - Enhance PruneScanRule to detect single partitions based on referenced 
dirs in the filter.
     - Keep a new status of EXPANDED_PARTIAL for FileSelection.
     - Create separate .directories metadata file to prune directories first 
before files.
     - Introduce cacheFileRoot attribute to keep track of the parent directory 
of the cache file after partition pruning.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/amansinha100/incubator-drill DRILL-4530-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/519.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #519
    
----
commit 9c9687e804fa05c8f4b7b065738c458cb88bf5c4
Author: Aman Sinha <asi...@maprtech.com>
Date:   2016-03-25T19:55:59Z

    DRILL-4530: Optimize partition pruning with metadata caching for the single 
partition case.
    
     - Enhance PruneScanRule to detect single partitions based on referenced 
dirs in the filter.
     - Keep a new status of EXPANDED_PARTIAL for FileSelection.
     - Create separate .directories metadata file to prune directories first 
before files.
     - Introduce cacheFileRoot attribute to keep track of the parent directory 
of the cache file after partition pruning.

----


> Improve metadata cache performance for queries with single partition 
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4530
>                 URL: https://issues.apache.org/jira/browse/DRILL-4530
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.7.0
>
>
> Consider two types of queries which are run with Parquet metadata caching: 
> {noformat}
> query 1:
> SELECT col FROM  `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2 
> elapsed time is 9 sec even though both are accessing the same amount of data. 
>  The user expectation is that they should perform roughly the same.  The main 
> difference comes from reading the bigger metadata cache file at the root 
> level 'A' for query2 and then applying the partitioning filter.  query1 reads 
> a much smaller metadata cache file at the subdirectory level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to