[ https://issues.apache.org/jira/browse/DRILL-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jinfeng Ni resolved DRILL-2400. ------------------------------- Resolution: Fixed Fixed in commit 6be9e3b and 6e6f789 > Drill's query planner did not merge multiple filters into one. > --------------------------------------------------------------- > > Key: DRILL-2400 > URL: https://issues.apache.org/jira/browse/DRILL-2400 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > Attachments: > 0005-DRILL-2240-Part-1-Change-cpu-cost-estimation-formula.patch, > 0006-DRILL-2240-Part-2-Add-merge-filter-rule-to-logical-r.patch > > > Let's say we have a view defined with a WHERE clause as the following: > {code} > "create or replace view MyViewWithFilter as " + > " SELECT first_name, " + > " last_name, " + > " full_name, " + > " salary, " + > " employee_id, " + > " store_id, " + > " position_id, " + > " position_title, " + > " education_level " + > " FROM cp.`employee.json` " + > " WHERE position_id in (1, 2, 3 ) " > {code} > For the following query which refers to the above view in two places, one in > out query, one in the subquery, each reference adds an additional filter, > {code} > select dat.store_id > , sum(dat.store_cost) as total_cost > from ( > select store_id, > position_id , > sum( salary) as store_cost > from MyViewWithFilter > where full_name in ( select n_name > from cp.`tpch/nation.parquet`) > and education_level = 'GRADUATE DEGREE' > and position_id in ( select position_id > from MyViewWithFilter > where position_title like '%VP%' > ) > group by store_id, position_id > ) dat > group by dat.store_id > order by dat.store_id > {code} > The current plan will be like the following : > {code} > 00-00 Screen > 00-01 Project(store_id=[$0], total_cost=[$1]) > 00-02 SelectionVectorRemover > 00-03 Sort(sort0=[$0], dir0=[ASC]) > 00-04 HashAgg(group=[{0}], total_cost=[SUM($1)]) > 00-05 Project(store_id=[$0], store_cost=[$2]) > 00-06 HashAgg(group=[{0, 1}], store_cost=[SUM($2)]) > 00-07 Project(store_id=[$1], position_id=[$2], $f3=[$0]) > 00-08 HashJoin(condition=[=($4, $5)], joinType=[inner]) > 00-10 Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], > $f11=[$2]) > 00-12 HashJoin(condition=[=($4, $5)], joinType=[inner]) > 00-15 SelectionVectorRemover > 00-18 Filter(condition=[=($3, 'GRADUATE DEGREE')]) > 00-20 Project($f3=[CAST($2):ANY NOT NULL], > $f5=[CAST($3):ANY NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY > NOT NULL], $f9=[CAST($1):ANY NOT NULL]) > 00-22 SelectionVectorRemover > 00-23 Filter(condition=[OR(=($0, 1), =($0, > 2), =($0, 3))]) > 00-24 Project(position_id=[$0], > full_name=[$4], salary=[$1], store_id=[$2], education_level=[$3]) > 00-25 Scan(groupscan=[EasyGroupScan > [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, > `full_name`, `salary`, `store_id`, `education_level`], > files=[/employee.json]]]) > 00-14 HashAgg(group=[{0}]) > 00-17 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], > selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]]) > 00-09 HashAgg(group=[{0}]) > 00-11 Project(position_id=[CAST($0):ANY NOT NULL]) > 00-13 SelectionVectorRemover > 00-16 Filter(condition=[LIKE(CAST($1):ANY NOT > NULL, '%VP%')]) > 00-19 Filter(condition=[OR(=($0, 1), =($0, 2), > =($0, 3))]) > 00-21 Scan(groupscan=[EasyGroupScan > [selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, > `position_title`], files=[/employee.json]]]) > {code} > In the plan, the View's own filter and the additional filter were not > merged, and we could see either in the out query or subquery, there are two > filters, like : > {code} > 00-16 Filter(condition=[LIKE(CAST($1):ANY NOT > NULL, '%VP%')]) > 00-19 Filter(condition=[OR(=($0, 1), =($0, 2), > =($0, 3))]) > {code} > It is desired to have filter merged, because > 1) it could improve the possibility of partition pruning with the merged > filter, since the partition pruning will match a filter directly on top of > SCAN. > Filter (dir0 = ' ....') > \ > Filter ( colA = '...') > \ > SCAN > If the filter with (dir0) is not merged with the second filter, then > partition pruning will not kick in > 2) Merged filter would allow short circuit evaluation, which could mean > better performance, compared with the case where we evaluate the filter one > by one. > There are two reasons why filters are not merged: > 1) Drill's planner does not enable the rule of MergerFilter rule > 2) Even if we enable MergeFilter rule, in some situation, we still could not > get filter merged, because the cpu cost estimation for filter is a bit off > the real cost. > In this JIRA, we'll > 1) Enable MergeFilter (Drill's version) > 2) Fix the cost estimation formula for Filter operator. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)