GitHub user chenghao-intel opened a pull request:

    https://github.com/apache/spark/pull/5326

    [SPARK-3862] [SQL] [WIP] MultiWayBroadcastJoin for LeftSemi & Inner JOIN

    Assume we have table `x`, `y`, `z`, and the `x` is the fact table with 
large mount of data, and `y`, `z` are dimensional tables.
    
    ```sql
    SELECT x.a, y.a, z.a FROM x JOIN y ON x.a = y.a AND y.a < 3 JOIN z ON x.a = 
z.a AND z.a > 1
    ```
    
    To compute the result, it's required multiple times reading / writing data 
for fact table(large amount of data) if we do that as binary join way; this PR 
(multiple way broadcast join) will reduce the IO overhead significantly by 
reading all of the data once, as well as the filtering effect of the multiple 
join filters.
    
    This PR is for earlier feedbacks, some TODOs as below, but probably can be 
done in another PRs
    - Multiway-join for JOINs in identical equi-join.
    - Join Reordering.
    - Integrated with Sort-Merge-Join in Multiway JOIN.
    - Code Clean Up, to unify the JOIN code by removing the binary 
join(replaced with multiple way join)
    
    Restrictions
    - The fact table should be in the left-most, we can improve that in `Join 
Reordering`.
    
    Benchmarking result will be provided soon...

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chenghao-intel/spark dim_join

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5326.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5326
    
----
commit 84d15d50693fbea35c11963484ef8cd798e7bd55
Author: Cheng Hao <[email protected]>
Date:   2015-03-26T03:01:21Z

    minor changes

commit 645b9bee819501e7aec8d2ae1b29812a857d9fde
Author: Cheng Hao <[email protected]>
Date:   2015-03-26T04:17:57Z

    update the code of empty check in HashedRelation related code

commit 90fa2858351d6e774a48f3502bd58f6eafa96dad
Author: Cheng Hao <[email protected]>
Date:   2015-03-25T08:15:01Z

    Add multiple row & multi-way join support

commit aa4bab2530e64fd0b51001de10176b7fa182e222
Author: Cheng Hao <[email protected]>
Date:   2015-04-01T06:17:57Z

    WIP broadcast join

commit b4cbabdb541192ea0e8864627ea8d3b25523e5b3
Author: Cheng Hao <[email protected]>
Date:   2015-04-02T05:48:13Z

    star schema

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to