[
https://issues.apache.org/jira/browse/DRILL-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pritesh Maker reassigned DRILL-4827:
------------------------------------
Assignee: Venkata Jyothsna Donapati (was: Aman Sinha)
> Checking modification time of directories takes too long, needs to be improved
> ------------------------------------------------------------------------------
>
> Key: DRILL-4827
> URL: https://issues.apache.org/jira/browse/DRILL-4827
> Project: Apache Drill
> Issue Type: Improvement
> Components: Functions - Drill
> Affects Versions: 1.8.0
> Environment: RHEL 6
> Reporter: Dechang Gu
> Assignee: Venkata Jyothsna Donapati
> Priority: Major
>
> This is tracking bug for metadata cache performance for directory checking.
> When evaluating the fix for Drill-4530, we run the following two queries on
> 50K parquet files in a 3-layer directory hierarchy:
> Query1: explain plan for select * from
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where
> dir0=2006 and dir1=12 and dir2=15;
> Query2: explain plan for select * from
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem/2006/12/15`;
> Query1 takes 3.254 secs. Query2 0.505 secs.
> Drillbit.log shows that for Query1, 2.5 secs spent after metadata cache was
> read and before partition pruning:
> 2016-08-02 15:43:43,051 ucs-node7.perf.lab
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
> o.a.drill.exec.work.foreman.Foreman - Query text for query id
> 285edddf-b1f3-cd74-e826-84cb91ebc6e1: explain plan for select * from
> dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where
> dir0=2006 and dir1=12 and dir2=15
> 2016-08-02 15:43:43,193 ucs-node7.perf.lab
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
> o.a.d.exec.store.parquet.Metadata - Took 6 ms to read directories from
> directory cache file
> 2016-08-02 15:43:45,745 ucs-node7.perf.lab
> [285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
> o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning
> class:
> org.apache.drill.exec.planner.logical.partition.PruneScanRule$DirPruneScanFilterOnScanRule
> Further investigation shows that the 2.5 secs was for checking modification
> time of directories, which is proportional to the number of directories to be
> checked.
> Looks like this can be improved by only checking the top level directory.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)