[ 
https://issues.apache.org/jira/browse/DRILL-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dechang Gu reassigned DRILL-4827:
---------------------------------

    Assignee: Aman Sinha

> Checking modification time of directories takes too long, needs to be improved
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-4827
>                 URL: https://issues.apache.org/jira/browse/DRILL-4827
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.8.0
>         Environment: RHEL 6
>            Reporter: Dechang Gu
>            Assignee: Aman Sinha
>
> This is tracking bug for metadata cache performance for directory checking.
> When evaluating the fix for Drill-4530, we run the following two queries on 
> 50K parquet files in a 3-layer directory hierarchy:
> Query1: explain plan for select * from 
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where 
> dir0=2006 and dir1=12 and dir2=15;
> Query2:  explain plan for select * from 
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem/2006/12/15`;
> Query1 takes 3.254 secs. Query2 0.505 secs.
> Drillbit.log shows that for Query1, 2.5 secs spent after metadata cache was 
> read and before partition pruning:
> 2016-08-02 15:43:43,051 ucs-node7.perf.lab 
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  
> o.a.drill.exec.work.foreman.Foreman - Query text for query id 
> 285edddf-b1f3-cd74-e826-84cb91ebc6e1: explain plan for select * from 
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where 
> dir0=2006 and dir1=12 and dir2=15
> 2016-08-02 15:43:43,193 ucs-node7.perf.lab 
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  
> o.a.d.exec.store.parquet.Metadata - Took 6 ms to read directories from 
> directory cache file
> 2016-08-02 15:43:45,745 ucs-node7.perf.lab 
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO  
> o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning 
> class: 
> org.apache.drill.exec.planner.logical.partition.PruneScanRule$DirPruneScanFilterOnScanRule
> Further investigation shows that the 2.5 secs was for checking modification 
> time of directories, which is proportional to the number of directories to be 
> checked.  
> Looks like this can be improved by only checking the top level directory. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to