[ 
https://issues.apache.org/jira/browse/DRILL-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084309#comment-16084309
 ] 

Jinfeng Ni commented on DRILL-5468:
-----------------------------------

The difference between the plan prior to DRILL-4678 and the plan on latest 
master is the join with the IN-clause subquery is switched from broadcast 
(prior to DRILL-4678) to distributed join (latest master). The new plan 
actually makes more sense, given how Drill's costing estimation works. Prior to 
DRILL-4678, it's kind of with luck that Drill ends with a broadcast join plan. 

Here is the detailed analysis:

A simplified query to re-produce the problem. :

{code}
select
  count(*)
from
  orders_hive_int32 o
where
  o.o_orderkey in (
    select
      l_orderkey
    from
      lineitem_hive_int32
    group by
      l_orderkey having
        sum(l_quantity) > 300
  )
{code} 

Notice that IN-CLAUSE subquery has a group by "l_orderkey", which is same as 
the comparison key for IN-clause.  For IN-CLAUSE subquery,  Calcite would put 
an Aggregate over the comparison key. In this case, the IN-CLAUSE subquery's GB 
key is same as the comparison key, ideally it should remove the redundant 
Aggregate.  Prior-to DRILL-4678,  Drill's planner failed to recognize such 
redundancy, and keep the redundant aggregate. Here is the {{logical plan}} 
prior to DRILL-4678. Notice that there are two aggregates, operating on the 
same key, with the top one simply doing duplicate removal. 

{code}
DrillAggregateRel(group=[{}], EXPR$0=[COUNT()]): rowcount = 1.0, cumulative 
cost = {1.919693144E8 rows, 1.8386736168E9 cpu, 0.0 io, 0.0 network, 
1.29332996672E9 memory}, id = 175
  DrillProjectRel($f0=[0]): rowcount = 1.5E7, cumulative cost = {1.769693144E8 
rows, 1.6586736168E9 cpu, 0.0 io, 0.0 network, 1.16132996672E9 memory}, id = 174
    DrillJoinRel(condition=[=($0, $1)], joinType=[inner]): rowcount = 1.5E7, 
cumulative cost = {1.619693144E8 rows, 1.5986736168E9 cpu, 0.0 io, 0.0 network, 
1.16132996672E9 memory}, id = 173
      DrillScanRel(table=[[dfs, tpchsf10, orders]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/drill/testdata/tpch-sf10/orders]], 
selectionRoot=file:/drill/testdata/tpch-sf10/orders, numFiles=1, 
usedMetadataFile=false, columns=[`o_orderkey`]]]): rowcount = 1.5E7, cumulative 
cost = {1.5E7 rows, 1.5E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 166
      DrillAggregateRel(group=[{0}]): rowcount = 2999302.6, cumulative cost = 
{1.289700118E8 rows, 1.379679196E9 cpu, 0.0 io, 0.0 network, 1.10854224096E9 
memory}, id = 172
        DrillProjectRel(l_orderkey=[$0]): rowcount = 2999302.6, cumulative cost 
= {1.259707092E8 rows, 1.3556847752E9 cpu, 0.0 io, 0.0 network, 1.0557545152E9 
memory}, id = 171
          DrillFilterRel(condition=[>($1, 300)]): rowcount = 2999302.6, 
cumulative cost = {1.259707092E8 rows, 1.3556847752E9 cpu, 0.0 io, 0.0 network, 
1.0557545152E9 memory}, id = 170
            DrillAggregateRel(group=[{0}], agg#0=[SUM($1)]): rowcount = 
5998605.2, cumulative cost = {1.19972104E8 rows, 1.319693144E9 cpu, 0.0 io, 0.0 
network, 1.0557545152E9 memory}, id = 169
              DrillScanRel(table=[[dfs, tpchsf10, lineitem]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=file:/drill/testdata/tpch-sf10/lineitem]], 
selectionRoot=file:/drill/testdata/tpch-sf10/lineitem, numFiles=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_quantity`]]]): rowcount = 
5.9986052E7, cumulative cost = {5.9986052E7 rows, 1.19972104E8 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 155
{code}

The redundant Aggregate in theory would incur additional cost.  On the other 
hand, Drill's cost estimation would reduce the rowCount to 10%, for every 
aggregate operator. (Due to the fact Drill does not have and maintain distinct 
row count statistics).  This 10% rowCount would make planner prefer broadcast, 
in stead of distribution-based join. 

With DRILL-4678, the new RelMetadataProvider code {{correctly}} detects the two 
aggregate operators are applied to the same set of keys. As such, the second 
aggregate would be removed.  However, that would cause the planner not to take 
10% as before, and hence planner would prefer distribution-based join, in stead 
of broadcast join.

Here is the {{logical plan}} on latest master:
{code}
DrillAggregateRel(group=[{}], EXPR$0=[COUNT()]): rowcount = 1.0, cumulative 
cost = {1.889700118E8 rows, 1.814679196E9 cpu, 0.0 io, 0.0 network, 
1.24054224096E9 memory}, id = 279
  DrillProjectRel($f0=[0]): rowcount = 1.5E7, cumulative cost = {1.739700118E8 
rows, 1.634679196E9 cpu, 0.0 io, 0.0 network, 1.10854224096E9 memory}, id = 278
    DrillJoinRel(condition=[=($0, $1)], joinType=[inner]): rowcount = 1.5E7, 
cumulative cost = {1.589700118E8 rows, 1.574679196E9 cpu, 0.0 io, 0.0 network, 
1.10854224096E9 memory}, id = 277
      DrillScanRel(table=[[dfs, tpchsf10, orders]], groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=file:/drill/testdata/tpch-sf10/orders]], 
selectionRoot=file:/drill/testdata/tpch-sf10/orders, numFiles=1, 
usedMetadataFile=false, columns=[`o_orderkey`]]]): rowcount = 1.5E7, cumulative 
cost = {1.5E7 rows, 1.5E7 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 271
      DrillProjectRel(l_orderkey=[$0]): rowcount = 2999302.6, cumulative cost = 
{1.259707092E8 rows, 1.3556847752E9 cpu, 0.0 io, 0.0 network, 1.0557545152E9 
memory}, id = 276
        DrillFilterRel(condition=[>($1, 300)]): rowcount = 2999302.6, 
cumulative cost = {1.259707092E8 rows, 1.3556847752E9 cpu, 0.0 io, 0.0 network, 
1.0557545152E9 memory}, id = 275
          DrillAggregateRel(group=[{0}], agg#0=[SUM($1)]): rowcount = 
5998605.2, cumulative cost = {1.19972104E8 rows, 1.319693144E9 cpu, 0.0 io, 0.0 
network, 1.0557545152E9 memory}, id = 274
            DrillScanRel(table=[[dfs, tpchsf10, lineitem]], 
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=file:/drill/testdata/tpch-sf10/lineitem]], 
selectionRoot=file:/drill/testdata/tpch-sf10/lineitem, numFiles=1, 
usedMetadataFile=false, columns=[`l_orderkey`, `l_quantity`]]]): rowcount = 
5.9986052E7, cumulative cost = {5.9986052E7 rows, 1.19972104E8 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 259
{code} 

To summary,  the patch DRILL-4678 seems to do the right thing in identifying 
redundant aggregation.  It's Drill's missing distinct row count statistics  
that causes the problem ;  Prior to DRILL-4678,  Drill just has enough {{lucky} 
to get this better plan. 

> THCH Q18 regressed ~3x due to execution plan changes
> ----------------------------------------------------
>
>                 Key: DRILL-5468
>                 URL: https://issues.apache.org/jira/browse/DRILL-5468
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.11.0
>         Environment: 10+1 node ucs-micro cluster RHEL6.4
>            Reporter: Dechang Gu
>            Assignee: Jinfeng Ni
>             Fix For: 1.11.0
>
>         Attachments: Q18_profile_gitid_841ead4, Q18_profile_gitid_adbf363
>
>
> In a regular regression test on Drill master (commit id 841ead4) TPCH Q18 on 
> SF100 parquet dataset took ~81 secs, while the same query on 1.10.0 took only 
> ~27 secs.  The query time on the commit adbf363 which is right before 841ead4 
> is ~32 secs.
> Profiles shows the plans for the query changed quite a bit (profiles will be 
> uploaded) 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to