Dechang Gu created DRILL-4827:
---------------------------------
Summary: Checking modification time of directories takes too long,
needs to be improved
Key: DRILL-4827
URL: https://issues.apache.org/jira/browse/DRILL-4827
Project: Apache Drill
Issue Type: Improvement
Components: Functions - Drill
Affects Versions: 1.8.0
Environment: RHEL 6
Reporter: Dechang Gu
This is tracking bug for metadata cache performance for directory checking.
When evaluating the fix for Drill-4530, we run the following two queries on 50K
parquet files in a 3-layer directory hierarchy:
Query1: explain plan for select * from
dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where
dir0=2006 and dir1=12 and dir2=15;
Query2: explain plan for select * from
dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem/2006/12/15`;
Query1 takes 3.254 secs. Query2 0.505 secs.
Drillbit.log shows that for Query1, 2.5 secs spent after metadata cache was
read and before partition pruning:
2016-08-02 15:43:43,051 ucs-node7.perf.lab
[285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
o.a.drill.exec.work.foreman.Foreman - Query text for query id
285edddf-b1f3-cd74-e826-84cb91ebc6e1: explain plan for select * from
dfs.`/tpchMetaParquet/tpch100_dir_partitioned_50000files/lineitem` where
dir0=2006 and dir1=12 and dir2=15
2016-08-02 15:43:43,193 ucs-node7.perf.lab
[285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
o.a.d.exec.store.parquet.Metadata - Took 6 ms to read directories from
directory cache file
2016-08-02 15:43:45,745 ucs-node7.perf.lab
[285edddf-b1f3-cd74-e826-84cb91ebc6e1:foreman] INFO
o.a.d.e.p.l.partition.PruneScanRule - Beginning partition pruning, pruning
class:
org.apache.drill.exec.planner.logical.partition.PruneScanRule$DirPruneScanFilterOnScanRule
Further investigation shows that the 2.5 secs was for checking modification
time of directories, which is proportional to the number of directories to be
checked.
Looks like this can be improved by only checking the top level directory.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)