[ https://issues.apache.org/jira/browse/DRILL-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinfeng Ni updated DRILL-2157: ------------------------------ Fix Version/s: 1.0.0 Assignee: Aman Sinha (was: Jinfeng Ni) > Directory pruning on subdirectories only and data type conversions for > directory filters > ---------------------------------------------------------------------------------------- > > Key: DRILL-2157 > URL: https://issues.apache.org/jira/browse/DRILL-2157 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning & Optimization > Affects Versions: Future > Reporter: Andries Engelbrecht > Assignee: Aman Sinha > Priority: Minor > Fix For: 1.0.0 > > > Drill will scan all files and directories when using only a subdirectory as a > predicate. Additionally if the data type for the directory filter is not a > string and is converted Drill will also first scan all the subdirectories adn > files before applying the filter. > My current observation is that for a directory structure as listed below, > the pruning only works if the full tree is provided. If only a lower level > directory is supplied in the filter condition Drill only uses it as a > filter. > With directory structure as below > /2015 > /01 > /10 > /11 > /12 > /13 > /14 > Query: > select count(id) from `/foo` t where dir0='2015' and dir1='01' and > dir2='10' > Produces the correct pruning and query plan > 01-02 Project(id=[$3]): rowcount = 3670316.0, cumulative cost = > {1.1010948E7 rows, 1.4681284E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = > 28434 > 01-03 Project(dir0=[$0], dir1=[$3], dir2=[$2], id=[$1]): > rowcount = 3670316.0, cumulative cost = {7340632.0 rows, 1.468128E7 cpu, > 0.0 io, 0.0 network, 0.0 memory}, id = 28433 > 01-04 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, > numFiles=24, columns=[`dir0`, `dir1`, `dir2`, `id`] > However: > select count(id) from `/foo` t where dir2='10' > Produces full scan of all sub directories and only applies a filter > condition after the fact. Notice the numFiles between the 2, even though it > lists columns in the base scan > 01-04 Filter(condition=[=($0, '10')]): rowcount = > 9423761.7, cumulative cost = {1.88475234E8 rows, 3.76950476E8 cpu, 0.0 io, > 0.0 network, 0.0 memory}, id = 27470 > 01-05 Project(dir2=[$1], id=[$0]): rowcount = > 6.2825078E7, cumulative cost = {1.25650156E8 rows, 1.25650164E8 cpu, 0.0 > io, 0.0 network, 0.0 memory}, id = 27469 > 01-06 Scan(groupscan=[EasyGroupScan > [selectionRoot=/foo, numFiles=405, columns=[`dir2`, `id`] > Also using the wrong data type for the filter produces a full scan > select count(id) from `/foo` where dir_year=2015 and dir_month=01 and > dir_day=14 > Produces > 01-04 Filter(condition=[AND(=(CAST($1):ANY NOT NULL, 2015), > =(CAST($2):ANY NOT NULL, 1), =(CAST($3):ANY NOT NULL, 10))]): rowcount = > 212034.63825, cumulative cost = {1.88475234E8 rows, 1.005201264E9 cpu, 0.0 > io, 0.0 network, 0.0 memory}, id = 34910 > 01-05 Project(id=[$2], dir0=[$3], dir1=[$1], dir2=[$0]): > rowcount = 6.2825078E7, cumulative cost = {1.25650156E8 rows, 2.51300328E8 > cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 34909 > 01-06 Scan(groupscan=[EasyGroupScan [selectionRoot=/foo, > numFiles=405, columns=[`id`, `dir0`, `dir1`, `dir2`], -- This message was sent by Atlassian JIRA (v6.3.4#6332)