[
https://issues.apache.org/jira/browse/DRILL-7013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boaz Ben-Zvi updated DRILL-7013:
--------------------------------
Description:
The Hash-Join and Hash-Aggr operators copy each incoming row separately. When
the incoming data has a selection vector (e.g., outgoing from a Filter), a
_SelectionVectorRemover_ is added before the Hash operator, as the latter
cannot handle the selection vector.
Thus every row is needlessly being copied twice!
+Suggestion+: Enhance the Hash operators to handle potential incoming selection
vectors, thus eliminating the need for the extra copy. The planner needs to be
changed not to add that SelectionVectorRemover.
(Two comments:
* Note the special case of Hash-Join with num_partitions = 1, where the build
side vectors are used as is, not copied.
* Conflicts with the suggestion not to copy probe vectors, in DRILL-5912 )
For example:
{code:sql}
select * from cp.`tpch/lineitem.parquet` L, cp.`tpch/orders.parquet` O where
O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey
{code}
And the plan:
{panel}
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0):
00-01 ProjectAllowDup(=[$0], 0=[$1]) : rowType = RecordType(DYNAMIC_STAR ,
DYNAMIC_STAR 0):
00-02 Project(T44¦¦=[$0], T45¦¦=[$2]) : rowType = RecordType(DYNAMIC_STAR
T44¦¦, DYNAMIC_STAR T45¦¦):
00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false]) :
rowType = RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey, DYNAMIC_STAR T45¦¦,
ANY o_custkey, ANY o_orderkey):
00-05 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦,
ANY l_orderkey):
00-07 Filter(condition=[>($1, 58999)]) : rowType = RecordType(DYNAMIC_STAR
T44¦¦, ANY l_orderkey):
00-09 Project(T44¦¦=[$0], l_orderkey=[$1]) : rowType =
RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey):
00-11 Scan(table=[[cp, tpch/lineitem.parquet]],
groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
[path=classpath:/tpch/lineitem.parquet]],
00-04 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦,
ANY o_custkey, ANY o_orderkey):
00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType =
RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey):
00-08 Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType =
RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey):
00-10 Scan(table=[[cp, tpch/orders.parquet]],
{panel}
was:
The Hash-Join and Hash-Aggr operators copy each incoming row separately. When
the incoming data has a selection vector (e.g., outgoing from a Filter), a
_SelectionVectorRemover_ is added before the Hash operator, as the latter
cannot handle the selection vector.
Thus every row is needlessly being copied twice!
+Suggestion+: Enhance the Hash operators to handle potential incoming selection
vectors, thus eliminating the need for the extra copy. The planner needs to be
changed not to add that SelectionVectorRemover.
For example:
{code:sql}
select * from cp.`tpch/lineitem.parquet` L, cp.`tpch/orders.parquet` O where
O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey
{code}
And the plan:
{panel}
00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0):
00-01 ProjectAllowDup(**=[$0], **0=[$1]) : rowType = RecordType(DYNAMIC_STAR
**, DYNAMIC_STAR **0):
00-02 Project(T44¦¦**=[$0], T45¦¦**=[$2]) : rowType = RecordType(DYNAMIC_STAR
T44¦¦**, DYNAMIC_STAR T45¦¦**):
00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false]) :
rowType = RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey, DYNAMIC_STAR
T45¦¦**, ANY o_custkey, ANY o_orderkey):
00-05 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦**,
ANY l_orderkey):
00-07 Filter(condition=[>($1, 58999)]) : rowType = RecordType(DYNAMIC_STAR
T44¦¦**, ANY l_orderkey):
00-09 Project(T44¦¦**=[$0], l_orderkey=[$1]) : rowType =
RecordType(DYNAMIC_STAR T44¦¦**, ANY l_orderkey):
00-11 Scan(table=[[cp, tpch/lineitem.parquet]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath [path=classpath:/tpch/lineitem.parquet]],
00-04 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦**,
ANY o_custkey, ANY o_orderkey):
00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType =
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey):
00-08 Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType =
RecordType(DYNAMIC_STAR T45¦¦**, ANY o_custkey, ANY o_orderkey):
00-10 Scan(table=[[cp, tpch/orders.parquet]],
{panel}
> Hash-Join and Hash-Aggr to handle incoming with selection vectors
> -----------------------------------------------------------------
>
> Key: DRILL-7013
> URL: https://issues.apache.org/jira/browse/DRILL-7013
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Relational Operators, Query Planning &
> Optimization
> Affects Versions: 1.15.0
> Reporter: Boaz Ben-Zvi
> Priority: Minor
>
> The Hash-Join and Hash-Aggr operators copy each incoming row separately.
> When the incoming data has a selection vector (e.g., outgoing from a Filter),
> a _SelectionVectorRemover_ is added before the Hash operator, as the latter
> cannot handle the selection vector.
> Thus every row is needlessly being copied twice!
> +Suggestion+: Enhance the Hash operators to handle potential incoming
> selection vectors, thus eliminating the need for the extra copy. The planner
> needs to be changed not to add that SelectionVectorRemover.
> (Two comments:
> * Note the special case of Hash-Join with num_partitions = 1, where the build
> side vectors are used as is, not copied.
> * Conflicts with the suggestion not to copy probe vectors, in DRILL-5912 )
>
> For example:
> {code:sql}
> select * from cp.`tpch/lineitem.parquet` L, cp.`tpch/orders.parquet` O where
> O.o_custkey > 1498 and L.l_orderkey > 58999 and O.o_orderkey = L.l_orderkey
> {code}
> And the plan:
> {panel}
> 00-00 Screen : rowType = RecordType(DYNAMIC_STAR **, DYNAMIC_STAR **0):
> 00-01 ProjectAllowDup(=[$0], 0=[$1]) : rowType = RecordType(DYNAMIC_STAR ,
> DYNAMIC_STAR 0):
> 00-02 Project(T44¦¦=[$0], T45¦¦=[$2]) : rowType = RecordType(DYNAMIC_STAR
> T44¦¦, DYNAMIC_STAR T45¦¦):
> 00-03 HashJoin(condition=[=($1, $4)], joinType=[inner], semi-join: =[false])
> : rowType = RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey, DYNAMIC_STAR
> T45¦¦, ANY o_custkey, ANY o_orderkey):
> 00-05 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T44¦¦,
> ANY l_orderkey):
> 00-07 Filter(condition=[>($1, 58999)]) : rowType =
> RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey):
> 00-09 Project(T44¦¦=[$0], l_orderkey=[$1]) : rowType =
> RecordType(DYNAMIC_STAR T44¦¦, ANY l_orderkey):
> 00-11 Scan(table=[[cp, tpch/lineitem.parquet]],
> groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath
> [path=classpath:/tpch/lineitem.parquet]],
> 00-04 *SelectionVectorRemover* : rowType = RecordType(DYNAMIC_STAR T45¦¦,
> ANY o_custkey, ANY o_orderkey):
> 00-06 Filter(condition=[AND(>($1, 1498), >($2, 58999))]) : rowType =
> RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey):
> 00-08 Project(T45¦¦**=[$0], o_custkey=[$1], o_orderkey=[$2]) : rowType
> = RecordType(DYNAMIC_STAR T45¦¦, ANY o_custkey, ANY o_orderkey):
> 00-10 Scan(table=[[cp, tpch/orders.parquet]],
> {panel}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)