[
https://issues.apache.org/jira/browse/DRILL-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boaz Ben-Zvi updated DRILL-5935:
--------------------------------
Attachment: 26045873-f26c-8417-3f13-4e74af1f2317.sys.drill
> Hash Join projects unneeded columns
> -----------------------------------
>
> Key: DRILL-5935
> URL: https://issues.apache.org/jira/browse/DRILL-5935
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators
> Affects Versions: 1.11.0
> Reporter: Boaz Ben-Zvi
> Priority: Minor
> Labels: Hash-Join
> Attachments: 26045873-f26c-8417-3f13-4e74af1f2317.sys.drill
>
>
> The Hash Join operator projects all its input columns, including unneeded
> ones, relying on (multiple) project operators downstream to remove those
> columns. This is significantly wasteful, in both time and space (as each
> value is copied individually). Instead, the Hash Join itself should not
> project these unneeded columns.
> In the following example, the join-key columns need not be projected.
> However the two hash join operators do project them. Another problem: The
> join-key columns are copied from BOTH sides (build and probe), which is a
> waste, as both are IDENTICAL.
> Last - the plan in this example places the first join under the _build_
> side of the second join; and the unneeded column from the first join (the
> join-key) is taken and finally projected by the second join.
> The sample query is:
> {code}
> select c.c_first_name, c.c_last_name, s.ss_quantity, a.ca_city
> from dfs.`/data/json/s1/customer` c, dfs.`/data/json/s1/store_sales` s,
> dfs.`/data/json/s1/customer_address` a
> where c.c_customer_sk = s.ss_customer_sk and c.c_customer_id =
> a.ca_address_id;
> {code}
> The plan first builds on 'customer_address' and probes with 'customer', and
> the output projects all 6 columns (2 from 'a', 4 from 'c'). Then the second
> join builds on all those 6 columns from the first join, and probes from the
> large table 'store_sales', and finally all 8 columns are projected (see
> below). Then 3 project operators are used to remove the unneeded columns (see
> attached profile) - hence more waste.
> {code}
> public void projectBuildRecord(int buildIndex, int outIndex)
> throws SchemaChangeException
> {
> {
> vv3 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv0
> [((buildIndex)>>> 16)]);
> }
> {
> vv9 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv6
> [((buildIndex)>>> 16)]);
> }
> {
> vv15 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv12
> [((buildIndex)>>> 16)]);
> }
> {
> vv21 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv18
> [((buildIndex)>>> 16)]);
> }
> {
> vv27 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv24
> [((buildIndex)>>> 16)]);
> }
> {
> vv33 .copyFromSafe(((buildIndex)& 65535), (outIndex), vv30
> [((buildIndex)>>> 16)]);
> }
> }
> public void projectProbeRecord(int probeIndex, int outIndex)
> throws SchemaChangeException
> {
> {
> vv39 .copyFromSafe((probeIndex), (outIndex), vv36);
> }
> {
> vv45 .copyFromSafe((probeIndex), (outIndex), vv42);
> }
> }
> {code}
>
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)