GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/7773
[SPARK-2205] [SQL] [WIP] Avoid unnecessary exchange operators in multi-way
joins
This PR adds `PartitioningCollection`, which is used to represent the
`outputPartitioning` for SparkPlans with multiple children (e.g.
`ShuffledHashJoin`). So, a `SparkPlan` can have multiple descriptions of its
partitioning schemes. Taking `ShuffledHashJoin` as an example, it has two
descriptions of its partitioning schemes, i.e. `left.outputPartitioning` and
`right.outputPartitioning`. So when we have a query like `select * from t1 join
t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x)` will only have three Exchange
operators (when shuffled joins are needed) instead of four.
The code in this PR was authored by @yhuai; I'm opening this PR to factor
out this change from #7685, a larger pull request which contains two other
optimizations.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
multi-way-join-planning-improvements
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7773.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7773
----
commit 220112906737b3db668513a024423b35a2c2f32a
Author: Yin Huai <[email protected]>
Date: 2015-07-23T19:21:08Z
Filter out rows that will not be joined in equal joins early.
commit d5b84c399c6966ad509276b3f146948ff06e5ca4
Author: Yin Huai <[email protected]>
Date: 2015-07-24T01:48:47Z
Do not add unnessary filters.
commit 69bb0724eb1dd92d20afdde4b607d37bc4d5e4ca
Author: Yin Huai <[email protected]>
Date: 2015-07-24T01:49:34Z
Introduce NullSafeHashPartitioning and NullUnsafePartitioning.
commit 7c2d2d87a7182fbc9fc8b35fd75db64e147f0ff7
Author: Yin Huai <[email protected]>
Date: 2015-07-26T22:51:38Z
Bug fix and refactoring.
commit e616d3b0a2fa5836956c15b9f64410683a3ef9db
Author: Yin Huai <[email protected]>
Date: 2015-07-27T03:28:49Z
wip
commit c6667e745b0ce0c24dccd419d8fea10e21d24290
Author: Yin Huai <[email protected]>
Date: 2015-07-27T05:03:46Z
Add PartitioningCollection.
commit f9516b0687a90713f2b401d49418ec8ee081f457
Author: Yin Huai <[email protected]>
Date: 2015-07-27T05:29:48Z
Style
commit d3d2e646d525cc9c6e425ae99020d26bbaab10dc
Author: Yin Huai <[email protected]>
Date: 2015-07-27T21:14:34Z
First round of cleanup.
commit c57a95465a2410fa515d6bbcf3dd0276a19f1d21
Author: Yin Huai <[email protected]>
Date: 2015-07-27T23:39:49Z
Bug fix.
commit 247e5fa980bc1440596c7ab5a4fdcfaf204351da
Author: Josh Rosen <[email protected]>
Date: 2015-07-30T02:29:55Z
Merge remote-tracking branch 'origin/master' into
multi-way-join-planning-improvements
commit 884ab953cbec972c87b6fc9b6dcc966632c01dee
Author: Josh Rosen <[email protected]>
Date: 2015-07-30T02:51:01Z
Carve out only SPARK-2205 changes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]