[ 
https://issues.apache.org/jira/browse/DRILL-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109938#comment-14109938
 ] 

Jinfeng Ni commented on DRILL-1337:
-----------------------------------

Similar issue will happen when the join condition contains a local LHS filter. 
For the following query, it is supposedly return 25. However, Drill will return 
14, since it evaluate the local LHS filter as a post-join filter, will 
incorrectly discard 11 qualified rows. 

select count(*) from cp.`tpch/nation.parquet` n left outer join 
cp.`tpch/region.parquet` r on n.n_regionkey = r.r_regionkey and n.n_nationkey > 
10;
EXPR$0
14
Total rows returned : 1 
 
Drill Physical :
00-00    Screen: rowcount = 1.0, cumulative cost = {128.85 rows, 608.35 cpu, 
0.0 io, 0.0 network}, id = 500
00-01      StreamAgg(group=[{}], EXPR$0=[COUNT()]): rowcount = 1.0, cumulative 
cost = {128.75 rows, 608.25 cpu, 0.0 io, 0.0 network}, id = 499
00-02        Project($f0=[0]): rowcount = 6.25, cumulative cost = {122.5 rows, 
533.25 cpu, 0.0 io, 0.0 network}, id = 498
00-03          SelectionVectorRemover: rowcount = 6.25, cumulative cost = 
{116.25 rows, 529.25 cpu, 0.0 io, 0.0 network}, id = 497
00-04            Filter(condition=[$1]): rowcount = 6.25, cumulative cost = 
{110.0 rows, 523.0 cpu, 0.0 io, 0.0 network}, id = 496
00-05              HashJoin(condition=[=($0, $2)], joinType=[left]): rowcount = 
25.0, cumulative cost = {85.0 rows, 423.0 cpu, 0.0 io, 0.0 network}, id = 495
00-07                Project(n_regionkey=[$1], $f3=[>($0, 10)]): rowcount = 
25.0, cumulative cost = {50.0 rows, 58.0 cpu, 0.0 io, 0.0 network}, id = 493
00-08                  Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], 
selectionRoot=/tpch/nation.parquet, columns=[SchemaPath [`n_regionkey`], 
SchemaPath [`n_nationkey`]]]]): rowcount = 25.0, cumulative cost = {25.0 rows, 
50.0 cpu, 0.0 io, 0.0 network}, id = 492
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tpch/region.parquet]], 
selectionRoot=/tpch/region.parquet, columns=[SchemaPath [`r_regionkey`]]]]): 
rowcount = 5.0, cumulative cost = {5.0 rows, 5.0 cpu, 0.0 io, 0.0 network}, id 
= 494

Drill Physical :
00-00    Screen: rowcount = 1.0, cumulative cost = {128.85 rows, 608.35 cpu, 
0.0 io, 0.0 network}, id = 500
00-01      StreamAgg(group=[{}], EXPR$0=[COUNT()]): rowcount = 1.0, cumulative 
cost = {128.75 rows, 608.25 cpu, 0.0 io, 0.0 network}, id = 499
00-02        Project($f0=[0]): rowcount = 6.25, cumulative cost = {122.5 rows, 
533.25 cpu, 0.0 io, 0.0 network}, id = 498
00-03          SelectionVectorRemover: rowcount = 6.25, cumulative cost = 
{116.25 rows, 529.25 cpu, 0.0 io, 0.0 network}, id = 497
00-04            Filter(condition=[$1]): rowcount = 6.25, cumulative cost = 
{110.0 rows, 523.0 cpu, 0.0 io, 0.0 network}, id = 496
00-05              HashJoin(condition=[=($0, $2)], joinType=[left]): rowcount = 
25.0, cumulative cost = {85.0 rows, 423.0 cpu, 0.0 io, 0.0 network}, id = 495
00-07                Project(n_regionkey=[$1], $f3=[>($0, 10)]): rowcount = 
25.0, cumulative cost = {50.0 rows, 58.0 cpu, 0.0 io, 0.0 network}, id = 493
00-08                  Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], 
selectionRoot=/tpch/nation.parquet, columns=[SchemaPath [`n_regionkey`], 
SchemaPath [`n_nationkey`]]]]): rowcount = 25.0, cumulative cost = {25.0 rows, 
50.0 cpu, 0.0 io, 0.0 network}, id = 492
00-06                Scan(groupscan=[ParquetGroupScan 
[entries=[ReadEntryWithPath [path=/tpch/region.parquet]], 
selectionRoot=/tpch/region.parquet, columns=[SchemaPath [`r_regionkey`]]]]): 
rowcount = 5.0, cumulative cost = {5.0 rows, 5.0 cpu, 0.0 io, 0.0 network}, id 
= 494 

> TPCH Q13 may return incorrect rows :  Drill may incorrectly pull up a local 
> right filter in a left outer join condition.
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-1337
>                 URL: https://issues.apache.org/jira/browse/DRILL-1337
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> For TPCH Q13, Drill may discard some qualified rows, since Drill pulls up a 
> local RHS filter in a left outer join, and put it on top of Join. There means 
> the local RHS is treated as a post-join condition, which will incorrectly 
> discard some qualified rows.
> select
>   c_count,
>   count(*) as custdist
> from
>   (
>     select
>       c.c_custkey,
>       count(o.o_orderkey)
>     from
>       cp.`tpch/customer.parquet` c 
>       left outer join cp.`tpch/orders.parquet` o 
>         on c.c_custkey = o.o_custkey
>         and o.o_comment not like '%special%requests%'
>     group by
>       c.c_custkey
>   ) as orders (c_custkey, c_count)
> group by
>   c_count
> order by
>   custdist desc,
>   c_count desc;
> Drill Physical : 
> .......................
> 02-06                              Filter(condition=[$3]): rowcount = 3750.0, 
> cumulative cost = {79500.0 rows, 568512.0 cpu, 0.0 io, 1.90464E8 network, 
> 264000.0 memory}, id = 2649
> 02-07                                HashJoin(condition=[=($0, $1)], 
> joinType=[left]): rowcount = 15000.0, cumulative cost = {64500.0 rows, 
> 508512.0 cpu, 0.0 io, 1.90464E8 network, 264000.0 memory}, id = 2648
> 02-09                                  HashToRandomExchange(dist0=[[$0]]): 
> rowcount = 1500.0, cumulative cost = {3000.0 rows, 25500.0 cpu, 0.0 io, 
> 6144000.0 network, 0.0 memory}, id = 2644
> 03-01                                    Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tpch/customer.parquet]], 
> selectionRoot=/tpch/customer.parquet, columns=[SchemaPath [`c_custkey`]]]]): 
> rowcount = 1500.0, cumulative cost = {1500.0 rows, 1500.0 cpu, 0.0 io, 0.0 
> network, 0.0 memory}, id = 2643
> 02-08                                  HashToRandomExchange(dist0=[[$0]]): 
> rowcount = 15000.0, cumulative cost = {45000.0 rows, 285012.0 cpu, 0.0 io, 
> 1.8432E8 network, 0.0 memory}, id = 2647
> 04-01                                    Project(o_custkey=[$1], 
> o_orderkey=[$0], $f4=[NOT(LIKE($2, '%special%requests%'))]): rowcount = 
> 15000.0, cumulative cost = {30000.0 rows, 45012.0 cpu, 0.0 io, 0.0 network, 
> 0.0 memory}, id = 2646
> 04-02                                      Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tpch/orders.parquet]], 
> selectionRoot=/tpch/orders.parquet, columns=[SchemaPath [`o_custkey`], 
> SchemaPath [`o_orderkey`], SchemaPath [`o_comment`]]]]): rowcount = 15000.0, 
> cumulative cost = {15000.0 rows, 45000.0 cpu, 0.0 io, 0.0 network, 0.0 
> memory}, id = 2645



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to