[ 
https://issues.apache.org/jira/browse/DRILL-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boaz Ben-Zvi updated DRILL-5935:
--------------------------------
    Attachment: 26045873-f26c-8417-3f13-4e74af1f2317.sys.drill

> Hash Join projects unneeded columns
> -----------------------------------
>
>                 Key: DRILL-5935
>                 URL: https://issues.apache.org/jira/browse/DRILL-5935
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Boaz Ben-Zvi
>            Priority: Minor
>              Labels: Hash-Join
>         Attachments: 26045873-f26c-8417-3f13-4e74af1f2317.sys.drill
>
>
>  The Hash Join operator projects all its input columns, including unneeded 
> ones, relying on (multiple) project operators downstream to remove those 
> columns. This is significantly wasteful, in both time and space (as each 
> value is copied individually). Instead, the Hash Join itself should not 
> project these unneeded columns. 
>    In the following example, the join-key columns need not be projected. 
> However the two hash join operators do project them.  Another problem: The 
> join-key columns are copied from BOTH sides (build and probe), which is a 
> waste, as both are IDENTICAL.
>    Last - the plan in this example places the first join under the _build_ 
> side of the second join; and the unneeded column from the first join (the 
> join-key) is taken and finally projected by the second join. 
> The sample query is:
> {code}
> select c.c_first_name, c.c_last_name, s.ss_quantity, a.ca_city 
> from dfs.`/data/json/s1/customer` c, dfs.`/data/json/s1/store_sales` s, 
> dfs.`/data/json/s1/customer_address` a
> where c.c_customer_sk = s.ss_customer_sk and c.c_customer_id = 
> a.ca_address_id; 
> {code}
> The plan first builds on 'customer_address' and probes with 'customer', and 
> the output projects all 6 columns (2 from 'a', 4 from 'c'). Then the second 
> join builds on all those 6 columns from the first join, and probes from the 
> large table 'store_sales', and finally all 8 columns are projected (see 
> below). Then 3 project operators are used to remove the unneeded columns (see 
> attached profile) - hence more waste.
> {code}
>     public void projectBuildRecord(int buildIndex, int outIndex)
>         throws SchemaChangeException
>     {
>         {
>             vv3 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv0 
> [((buildIndex)>>> 16)]);
>         }
>         {
>             vv9 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv6 
> [((buildIndex)>>> 16)]);
>         }
>         {
>             vv15 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv12 
> [((buildIndex)>>> 16)]);
>         }
>         {
>             vv21 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv18 
> [((buildIndex)>>> 16)]);
>         }
>         {
>             vv27 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv24 
> [((buildIndex)>>> 16)]);
>         }
>         {
>             vv33 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv30 
> [((buildIndex)>>> 16)]);
>         }
>     }
>     public void projectProbeRecord(int probeIndex, int outIndex)
>         throws SchemaChangeException
>     {
>         {
>             vv39 .copyFromSafe((probeIndex), (outIndex), vv36);
>         }
>         {
>             vv45 .copyFromSafe((probeIndex), (outIndex), vv42);
>         }
>     }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to