[ 
https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352088#comment-15352088
 ] 

ASF GitHub Bot commented on DRILL-4530:
---------------------------------------

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/519#discussion_r68678168
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java 
---
    @@ -47,16 +47,25 @@
       private List<FileStatus> statuses;
     
       public List<String> files;
    +  /**
    +   * root path for the selections
    +   */
       public final String selectionRoot;
    +  /**
    +   * root path for the metadata cache file (if any)
    +   */
    +  public final String cacheFileRoot;
    --- End diff --
    
    That was my initial approach (updating the selectionRoot without keeping a 
separate cacheFileRoot).   However,  I ran into a few issues.  The main one 
that I recall is that the dir0, dir1 etc columns are associated with the 
selectionRoot, so suppose I run the following query: 
        SELECT dir0, dir1 FROM dfs.tmp.t2 WHERE dir0=2015 AND dir1='Q1' 
    and if the selectionRoot gets modified to point to '2015/Q1'  then we have 
lost the context of the original dir0, dir1 because everything will become 
relative to the new selectionRoot.    This produces wrong results.   The same 
problem occurred with a SELECT *  query where the directory columns where not 
showing up correctly. 


> Improve metadata cache performance for queries with single partition 
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4530
>                 URL: https://issues.apache.org/jira/browse/DRILL-4530
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: Future
>
>
> Consider two types of queries which are run with Parquet metadata caching: 
> {noformat}
> query 1:
> SELECT col FROM  `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2 
> elapsed time is 9 sec even though both are accessing the same amount of data. 
>  The user expectation is that they should perform roughly the same.  The main 
> difference comes from reading the bigger metadata cache file at the root 
> level 'A' for query2 and then applying the partitioning filter.  query1 reads 
> a much smaller metadata cache file at the subdirectory level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to