GitHub user chenghao-intel opened a pull request:
https://github.com/apache/spark/pull/5326
[SPARK-3862] [SQL] [WIP] MultiWayBroadcastJoin for LeftSemi & Inner JOIN
Assume we have table `x`, `y`, `z`, and the `x` is the fact table with
large mount of data, and `y`, `z` are dimensional tables.
```sql
SELECT x.a, y.a, z.a FROM x JOIN y ON x.a = y.a AND y.a < 3 JOIN z ON x.a =
z.a AND z.a > 1
```
To compute the result, it's required multiple times reading / writing data
for fact table(large amount of data) if we do that as binary join way; this PR
(multiple way broadcast join) will reduce the IO overhead significantly by
reading all of the data once, as well as the filtering effect of the multiple
join filters.
This PR is for earlier feedbacks, some TODOs as below, but probably can be
done in another PRs
- Multiway-join for JOINs in identical equi-join.
- Join Reordering.
- Integrated with Sort-Merge-Join in Multiway JOIN.
- Code Clean Up, to unify the JOIN code by removing the binary
join(replaced with multiple way join)
Restrictions
- The fact table should be in the left-most, we can improve that in `Join
Reordering`.
Benchmarking result will be provided soon...
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/chenghao-intel/spark dim_join
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5326.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5326
----
commit 84d15d50693fbea35c11963484ef8cd798e7bd55
Author: Cheng Hao <[email protected]>
Date: 2015-03-26T03:01:21Z
minor changes
commit 645b9bee819501e7aec8d2ae1b29812a857d9fde
Author: Cheng Hao <[email protected]>
Date: 2015-03-26T04:17:57Z
update the code of empty check in HashedRelation related code
commit 90fa2858351d6e774a48f3502bd58f6eafa96dad
Author: Cheng Hao <[email protected]>
Date: 2015-03-25T08:15:01Z
Add multiple row & multi-way join support
commit aa4bab2530e64fd0b51001de10176b7fa182e222
Author: Cheng Hao <[email protected]>
Date: 2015-04-01T06:17:57Z
WIP broadcast join
commit b4cbabdb541192ea0e8864627ea8d3b25523e5b3
Author: Cheng Hao <[email protected]>
Date: 2015-04-02T05:48:13Z
star schema
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]