[jira] [Commented] (DRILL-6758) Hash Join should not return the join columns when they are not needed downstream

Boaz Ben-Zvi (JIRA) Fri, 21 Sep 2018 19:25:27 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624438#comment-16624438
 ]


Boaz Ben-Zvi commented on DRILL-6758:
-------------------------------------

Here is an example (see *o_custkey* and *c_custkey* in the HashJoin):

{code}
0: jdbc:drill:zk=local> explain plan including all attributes for select 
ord.o_orderstatus, ord.o_orderkey from cp.`tpch/orders.parquet` ord left join 
cp.`tpch/customer.parquet` cust on ord.o_custkey = cust.c_custkey;
+------+------+
| text | json |
+------+------+
| 00-00    Screen : rowType = RecordType(ANY o_orderstatus, ANY o_orderkey): 
rowcount = 15000.0, cumulative cost = {64500.0 rows, 300000.0 cpu, 0.0 io, 0.0 
network, 26400.000000000004 memory}, id = 11045
00-01      Project(o_orderstatus=[$0], o_orderkey=[$1]) : rowType = 
RecordType(ANY o_orderstatus, ANY o_orderkey): rowcount = 15000.0, cumulative 
cost = {63000.0 rows, 298500.0 cpu, 0.0 io, 0.0 network, 26400.000000000004 
memory}, id = 11044
00-02        Project(o_orderstatus=[$1], o_orderkey=[$2]) : rowType = 
RecordType(ANY o_orderstatus, ANY o_orderkey): rowcount = 15000.0, cumulative 
cost = {48000.0 rows, 268500.0 cpu, 0.0 io, 0.0 network, 26400.000000000004 
memory}, id = 11043
00-03          HashJoin(condition=[=($0, $3)], joinType=[left]) : rowType = 
RecordType(ANY o_custkey, ANY o_orderstatus, ANY o_orderkey, ANY c_custkey): 
rowcount = 15000.0, cumulative cost = {33000.0 rows, 238500.0 cpu, 0.0 io, 0.0 
network, 26400.000000000004 memory}, id = 11042
00-05            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/orders.parquet]], 
selectionRoot=classpath:/tpch/orders.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`o_custkey`, `o_orderstatus`, `o_orderkey`]]]) 
: rowType = RecordType(ANY o_custkey, ANY o_orderstatus, ANY o_orderkey): 
rowcount = 15000.0, cumulative cost = {15000.0 rows, 45000.0 cpu, 0.0 io, 0.0 
network, 0.0 memory}, id = 11040
00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
[path=classpath:/tpch/customer.parquet]], 
selectionRoot=classpath:/tpch/customer.parquet, numFiles=1, numRowGroups=1, 
usedMetadataFile=false, columns=[`c_custkey`]]]) : rowType = RecordType(ANY 
c_custkey): rowcount = 1500.0, cumulative cost = {1500.0 rows, 1500.0 cpu, 0.0 
io, 0.0 network, 0.0 memory}, id = 11041
{code}


> Hash Join should not return the join columns when they are not needed 
> downstream
> --------------------------------------------------------------------------------
>
>                 Key: DRILL-6758
>                 URL: https://issues.apache.org/jira/browse/DRILL-6758
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators, Query Planning &amp; 
> Optimization
>    Affects Versions: 1.14.0
>            Reporter: Boaz Ben-Zvi
>            Assignee: Hanumath Rao Maduri
>            Priority: Minor
>             Fix For: 1.15.0
>
>
> Currently the Hash-Join operator returns all its (both sides) incoming 
> columns. In cases where the join columns are not used further downstream, 
> this is a waste (allocating vectors, copying each value, etc).
>   Suggestion: Have the planner pass this information to the Hash-Join 
> operator, to enable skipping the return of these columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6758) Hash Join should not return the join columns when they are not needed downstream

Reply via email to