Jinfeng Ni created DRILL-2400:
---------------------------------

             Summary: Drill's query planner did not merge multiple filters into 
one. 
                 Key: DRILL-2400
                 URL: https://issues.apache.org/jira/browse/DRILL-2400
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
            Reporter: Jinfeng Ni
            Assignee: Jinfeng Ni


Let's say we have a view defined with a WHERE clause as the following:

{code}
"create or replace view MyViewWithFilter as " +
        " SELECT  first_name, " +
        "         last_name, " +
        "         full_name, " +
        "         salary, " +
        "         employee_id, " +
        "         store_id, " +
        "         position_id, " +
        "         position_title, " +
        "         education_level " +
        " FROM cp.`employee.json` " +
        " WHERE position_id in (1, 2, 3 ) "
{code}

For the following query which refers to the above view in two places, one in 
out query, one in the subquery, each reference adds an additional filter,

{code}
select dat.store_id
      , sum(dat.store_cost) as total_cost
 from (
       select store_id,  
                  position_id , 
                  sum( salary) as store_cost
       from MyViewWithFilter 
       where full_name in ( select n_name
                                       from cp.`tpch/nation.parquet`)
           and  education_level = 'GRADUATE DEGREE'   
           and position_id in ( select position_id 
                                          from MyViewWithFilter
                                          where position_title like '%VP%'
                                        )
  group by store_id, position_id
) dat
group by dat.store_id
order by dat.store_id
{code}

The current plan will be like the following :
{code}
00-00    Screen
00-01      Project(store_id=[$0], total_cost=[$1])
00-02        SelectionVectorRemover
00-03          Sort(sort0=[$0], dir0=[ASC])
00-04            HashAgg(group=[{0}], total_cost=[SUM($1)])
00-05              Project(store_id=[$0], store_cost=[$2])
00-06                HashAgg(group=[{0, 1}], store_cost=[SUM($2)])
00-07                  Project(store_id=[$1], position_id=[$2], $f3=[$0])
00-08                    HashJoin(condition=[=($4, $5)], joinType=[inner])
00-10                      Project($f3=[$0], $f5=[$1], $f6=[$2], $f8=[$3], 
$f11=[$2])
00-12                        HashJoin(condition=[=($4, $5)], joinType=[inner])
00-15                          SelectionVectorRemover
00-18                            Filter(condition=[=($3, 'GRADUATE DEGREE')])
00-20                              Project($f3=[CAST($2):ANY NOT NULL], 
$f5=[CAST($3):ANY NOT NULL], $f6=[CAST($0):ANY NOT NULL], $f8=[CAST($4):ANY NOT 
NULL], $f9=[CAST($1):ANY NOT NULL])
00-22                                SelectionVectorRemover
00-23                                  Filter(condition=[OR(=($0, 1), =($0, 2), 
=($0, 3))])
00-24                                    Project(position_id=[$0], 
full_name=[$4], salary=[$1], store_id=[$2], education_level=[$3])
00-25                                      Scan(groupscan=[EasyGroupScan 
[selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, `full_name`, 
`salary`, `store_id`, `education_level`], files=[/employee.json]]])
00-14                          HashAgg(group=[{0}])
00-17                            Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], 
selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
00-09                      HashAgg(group=[{0}])
00-11                        Project(position_id=[CAST($0):ANY NOT NULL])
00-13                          SelectionVectorRemover
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, 
'%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), 
=($0, 3))])
00-21                                Scan(groupscan=[EasyGroupScan 
[selectionRoot=/employee.json, numFiles=1, columns=[`position_id`, 
`position_title`], files=[/employee.json]]])
{code}

In the plan,  the View's own filter and the additional filter were not merged, 
and we could see either in the out query or subquery, there are two filters, 
like :
{code}
00-16                            Filter(condition=[LIKE(CAST($1):ANY NOT NULL, 
'%VP%')])
00-19                              Filter(condition=[OR(=($0, 1), =($0, 2), 
=($0, 3))])
{code}

It is desired to have filter merged, because 
1) it could improve the possibility of partition pruning with the merged 
filter, since the partition pruning will match a filter directly on top of SCAN.
   Filter   (dir0 = ' ....') 
      \     
   Filter   ( colA = '...')
          \
       SCAN
If the filter with (dir0) is not merged with the second filter, then partition 
pruning will not kick in

2) Merged filter would allow short circuit evaluation, which could mean better 
performance, compared with the case where we evaluate the filter one by one.

There are two reasons why filters are not merged:
1) Drill's planner does not enable the rule of MergerFilter rule
2) Even if we enable MergeFilter rule, in some situation, we still could not 
get filter merged, because the cpu cost estimation for filter is a bit off the 
real cost.

In this JIRA, we'll 

1) Enable MergeFilter (Drill's version)
2) Fix the cost estimation formula for Filter operator.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to