[
https://issues.apache.org/jira/browse/DRILL-4201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147610#comment-15147610
]
Victoria Markman commented on DRILL-4201:
-----------------------------------------
I verified that partial filter is getting pushed down, however it is not going
to happen always. It depends on the costing and heuristic there is a bit tricky.
In the case below, filter is not going to be pushed pass project, because file
vicky.json contains only 2 rows:
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . > *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > hive.lineitem_text_hive l
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > ( select
. . . . . . . . . . . . > flatten(test) as test,
. . . . . . . . . . . . > o_orderkey as orderkey
. . . . . . . . . . . . > from
. . . . . . . . . . . . >
dfs.`/drill/testdata/Tpch0.01/json/orders/vicky.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey )
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-02 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-03 HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default,
tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions=
null,
inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04 SelectionVectorRemover
00-06 Filter(condition=[AND(=($0, 1), =($1, 22))])
00-07 Flatten(flattenField=[$0])
00-08 Project(test=[$1], orderkey=[$0])
00-09 Scan(groupscan=[EasyGroupScan
[selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/vicky.json,
numFiles=1, columns=[`test`, `o_orderkey`],
files=[maprfs:///drill/testdata/Tpch0.01/json/orders/vicky.json]]])
{code}
It's not going to be pushed pass project even if I add 40 columns to be
projected (json file with 2 rows):
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . > *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > hive.lineitem_text_hive l
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > ( select
. . . . . . . . . . . . > flatten(test) as test,
. . . . . . . . . . . . > o_orderkey as orderkey,
. . . . . . . . . . . . > o_orderkey + 1 as o1,
. . . . . . . . . . . . > o_orderkey + 2 as o2,
. . . . . . . . . . . . > o_orderkey + 3 as o3,
. . . . . . . . . . . . > o_orderkey + 4 as o4,
. . . . . . . . . . . . > o_orderkey + 5 as o5,
. . . . . . . . . . . . > o_orderkey + 6 as o6,
. . . . . . . . . . . . > o_orderkey + 7 as o7,
. . . . . . . . . . . . > o_orderkey + 8 as o8,
. . . . . . . . . . . . > o_orderkey + 9 as o9,
. . . . . . . . . . . . > o_orderkey + 10 as o10,
. . . . . . . . . . . . > o_orderkey + 11 as o11,
. . . . . . . . . . . . > o_orderkey + 12 as o12,
. . . . . . . . . . . . > o_orderkey + 13 as o13,
. . . . . . . . . . . . > o_orderkey + 14 as o14,
. . . . . . . . . . . . > o_orderkey + 15 as o15,
. . . . . . . . . . . . > o_orderkey + 16 as o16,
. . . . . . . . . . . . > o_orderkey + 17 as o17,
. . . . . . . . . . . . > o_orderkey + 18 as o18,
. . . . . . . . . . . . > o_orderkey + 19 as o19,
. . . . . . . . . . . . > o_orderkey + 20 as o20,
. . . . . . . . . . . . > o_orderkey + 21 as o21,
. . . . . . . . . . . . > o_orderkey + 22 as o22,
. . . . . . . . . . . . > o_orderkey + 23 as o23,
. . . . . . . . . . . . > o_orderkey + 24 as o24,
. . . . . . . . . . . . > o_orderkey + 25 as o25,
. . . . . . . . . . . . > o_orderkey + 26 as o26,
. . . . . . . . . . . . > o_orderkey + 27 as o27,
. . . . . . . . . . . . > o_orderkey + 28 as o28,
. . . . . . . . . . . . > o_orderkey + 29 as o29,
. . . . . . . . . . . . > o_orderkey + 30 as o30,
. . . . . . . . . . . . > o_orderkey + 31 as o31,
. . . . . . . . . . . . > o_orderkey + 32 as o32,
. . . . . . . . . . . . > o_orderkey + 33 as o33,
. . . . . . . . . . . . > o_orderkey + 34 as o34,
. . . . . . . . . . . . > o_orderkey + 35 as o35,
. . . . . . . . . . . . > o_orderkey + 36 as o36,
. . . . . . . . . . . . > o_orderkey + 37 as o37,
. . . . . . . . . . . . > o_orderkey + 38 as o38,
. . . . . . . . . . . . > o_orderkey + 39 as o39,
. . . . . . . . . . . . > o_orderkey + 40 as o40
. . . . . . . . . . . . > from
. . . . . . . . . . . . >
dfs.`/drill/testdata/Tpch0.01/json/orders/vicky.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey)
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17], o1=[$18],
o2=[$19], o3=[$20], o4=[$21], o5=[$22], o6=[$23], o7=[$24], o8=[$25], o9=[$26],
o10=[$27], o11=[$28], o12=[$29], o13=[$30], o14=[$31], o15=[$32], o16=[$33],
o17=[$34], o18=[$35], o19=[$36], o20=[$37], o21=[$38], o22=[$39], o23=[$40],
o24=[$41], o25=[$42], o26=[$43], o27=[$44], o28=[$45], o29=[$46], o30=[$47],
o31=[$48], o32=[$49], o33=[$50], o34=[$51], o35=[$52], o36=[$53], o37=[$54],
o38=[$55], o39=[$56], o40=[$57])
00-02 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17], o1=[$18],
o2=[$19], o3=[$20], o4=[$21], o5=[$22], o6=[$23], o7=[$24], o8=[$25], o9=[$26],
o10=[$27], o11=[$28], o12=[$29], o13=[$30], o14=[$31], o15=[$32], o16=[$33],
o17=[$34], o18=[$35], o19=[$36], o20=[$37], o21=[$38], o22=[$39], o23=[$40],
o24=[$41], o25=[$42], o26=[$43], o27=[$44], o28=[$45], o29=[$46], o30=[$47],
o31=[$48], o32=[$49], o33=[$50], o34=[$51], o35=[$52], o36=[$53], o37=[$54],
o38=[$55], o39=[$56], o40=[$57])
00-03 HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default,
tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions=
null,
inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04 SelectionVectorRemover
00-06 Filter(condition=[AND(=($0, 1), =($1, 22))])
00-07 Flatten(flattenField=[$0])
00-08 Project(test=[$1], orderkey=[$0], o1=[+($0, 1)],
o2=[+($0, 2)], o3=[+($0, 3)], o4=[+($0, 4)], o5=[+($0, 5)], o6=[+($0, 6)],
o7=[+($0, 7)], o8=[+($0, 8)], o9=[+($0, 9)], o10=[+($0, 10)], o11=[+($0, 11)],
o12=[+($0, 12)], o13=[+($0, 13)], o14=[+($0, 14)], o15=[+($0, 15)], o16=[+($0,
16)], o17=[+($0, 17)], o18=[+($0, 18)], o19=[+($0, 19)], o20=[+($0, 20)],
o21=[+($0, 21)], o22=[+($0, 22)], o23=[+($0, 23)], o24=[+($0, 24)], o25=[+($0,
25)], o26=[+($0, 26)], o27=[+($0, 27)], o28=[+($0, 28)], o29=[+($0, 29)],
o30=[+($0, 30)], o31=[+($0, 31)], o32=[+($0, 32)], o33=[+($0, 33)], o34=[+($0,
34)], o35=[+($0, 35)], o36=[+($0, 36)], o37=[+($0, 37)], o38=[+($0, 38)],
o39=[+($0, 39)], o40=[+($0, 40)])
00-09 Scan(groupscan=[EasyGroupScan
[selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/vicky.json,
numFiles=1, columns=[`test`, `o_orderkey`],
files=[maprfs:///drill/testdata/Tpch0.01/json/orders/vicky.json]]])
{code}
Filter pushdown is triggered when size of the table increased to 300+ rows:
{code}
0: jdbc:drill:schema=dfs> explain plan for select
. . . . . . . . . . . . > *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > hive.lineitem_text_hive l
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > ( select
. . . . . . . . . . . . > flatten(test) as test,
. . . . . . . . . . . . > o_orderkey as orderkey
. . . . . . . . . . . . > from
. . . . . . . . . . . . >
dfs.`/drill/testdata/Tpch0.01/json/orders/orders_1.json`) as o
. . . . . . . . . . . . > on ( l.l_orderkey = o.orderkey )
. . . . . . . . . . . . > where test = 1 and o.orderkey = 22;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-02 Project(l_orderkey=[$0], l_partkey=[$1], l_suppkey=[$2],
l_linenumber=[$3], l_quantity=[$4], l_extendedprice=[$5], l_discount=[$6],
l_tax=[$7], l_returnflag=[$8], l_linestatus=[$9], l_shipdate=[$10],
l_commitdate=[$11], l_receiptdate=[$12], l_shipinstruct=[$13],
l_shipmode=[$14], l_comment=[$15], test=[$16], orderkey=[$17])
00-03 HashJoin(condition=[=($0, $17)], joinType=[inner])
00-05 Scan(groupscan=[HiveScan [table=Table(dbName:default,
tableName:lineitem_text_hive), columns=[`*`], numPartitions=0, partitions=
null,
inputDirectories=[maprfs:/drill/testdata/partition_pruning/hive/text/lineitem]]])
00-04 SelectionVectorRemover
00-06 Filter(condition=[=($0, 1)])
00-07 Flatten(flattenField=[$0])
00-08 Project(test=[$0], orderkey=[$1])
00-09 SelectionVectorRemover
00-10 Filter(condition=[=($1, 22)])
00-11 Project(test=[$1], o_orderkey=[$0])
00-12 Scan(groupscan=[EasyGroupScan
[selectionRoot=maprfs:/drill/testdata/Tpch0.01/json/orders/orders_1.json,
numFiles=1, columns=[`test`, `o_orderkey`],
files=[maprfs:///drill/testdata/Tpch0.01/json/orders/orders_1.json]]])
{code}
I will add this test case to the test suite. It is different from Jinfeng's
unit tests, he is testing with directory labels, which is a bit different code
pass.
In my opinion, we should revisit test coverage in this area after we switch to
HEP planner (DRILL-3996). Will file a separate enhancement for that.
Consider this fixed and verified in:
{code}
#Generated by Git-Commit-Id-Plugin
#Sat Feb 13 00:36:18 UTC 2016
git.commit.id.abbrev=0a2518d
[email protected]
git.commit.message.full=DRILL-4363\: Row count based pruning for parquet table
used in Limit n query.\n\nModify two existint unit testcase\:\n1)
TestPartitionFilter.testMainQueryFalseCondition()\: rowCount pruning applied
after false condition is transformed into LIMIT 0\n2)
TestLimitWithExchanges.testPushLimitPastUnionExchange()\: modify the testcase
to use Json source, so that it does not mix with PushLimitIntoScanRule.\n
git.commit.id=0a2518d7cf01a92a27a82e29edac5424bedf31d5
git.commit.message.short=DRILL-4363\: Row count based pruning for parquet table
used in Limit n query.
git.commit.user.name=Jinfeng Ni
git.build.user.name=vmarkman
git.commit.id.describe=0.9.0-602-g0a2518d
[email protected]
git.branch=0a2518d7cf01a92a27a82e29edac5424bedf31d5
git.commit.time=11.02.2016 @ 23\:01\:15 UTC
git.build.time=13.02.2016 @ 00\:36\:18 UTC
[email protected]\:apache/drill.git
{code}
> DrillPushFilterPastProject should allow partial filter pushdown.
> -----------------------------------------------------------------
>
> Key: DRILL-4201
> URL: https://issues.apache.org/jira/browse/DRILL-4201
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Fix For: 1.5.0
>
>
> Currently, DrillPushFilterPastProjectRule will stop pushing the filter down,
> if the filter itself has ITEM or FLATTEN function, or its input reference is
> referring to an ITEM or FLATTEN function. However, in case that the filter is
> a conjunction of multiple sub-filters, some of them refer to ITEM or FLATTEN
> but the other not, then we should allow partial filter to be pushed down. For
> instance,
> WHERE partition_col > 10 and flatten_output_col = 'ABC'.
> The "flatten_output_col" comes from the output of FLATTEN operator, and
> therefore flatten_output_col = 'ABC' should not pushed past the project. But
> partiion_col > 10 should be pushed down, such that we could trigger the
> pruning rule to apply partition pruning.
> It would be improve Drill query performance, when the partially pushed filter
> leads to partition pruning, or the partially pushed filter results in early
> filtering in upstream operator.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)