GitHub user ankurdave opened a pull request:
https://github.com/apache/spark/pull/9089
[SPARK-11077] [SQL] Join elimination in Catalyst
Join elimination is a query optimization where certain joins can be
eliminated when followed by projections that only keep columns from one side of
the join, and when certain columns are known to be unique or foreign keys. This
can be very useful for queries involving views and machine-generated queries.
This PR adds join elimination by (1) supporting unique and foreign key
hints in logical plans, (2) adding methods in the DataFrame API to let users
provide these hints, and (3) adding an optimizer rule that eliminates unique
key outer joins and referential integrity joins when followed by an appropriate
projection.
This change is described in detail here:
https://docs.google.com/document/d/1-YgQSQywHfAo4PhAT-zOOkFZtVcju99h3dYQq-i9GWQ/edit?usp=sharing
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ankurdave/spark SPARK-11077-JoinElimination
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9089.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9089
----
commit 4f528770ecf4a2ae780d6514fdc8c5e7cf899288
Author: Ankur Dave <[email protected]>
Date: 2015-08-04T05:33:59Z
Eliminate outer join before project
commit ae46ab0891e974f6491d4b266f08d95d7a1c1382
Author: Ankur Dave <[email protected]>
Date: 2015-08-12T20:15:50Z
Use KeyHint to do join elimination
commit df9ef1421cee2f8f94dac24a8116ad504a009a20
Author: Ankur Dave <[email protected]>
Date: 2015-08-12T23:25:30Z
Add foreign keys
commit b22f7025860fed1b3f7bd5147691f5ef887bca01
Author: Ankur Dave <[email protected]>
Date: 2015-08-13T02:49:26Z
Alias-aware join elimination + bugfixes
commit 9072cb70872b156027cb2e673a397cc01f326128
Author: Ankur Dave <[email protected]>
Date: 2015-08-13T03:22:55Z
Propagate foreign keys through Join operator
commit f430ea2c6413879403973fc4fdd4217dde9d27ec
Author: Ankur Dave <[email protected]>
Date: 2015-08-13T03:43:06Z
Remove key hints after join elimination
commit 130253101f2db627c42ea4f8759dfeef6c62e574
Author: Ankur Dave <[email protected]>
Date: 2015-08-17T01:55:36Z
Support inner joins based on referential integrity
commit 35949f54c53357a86e0a2e2aeb0e5524a8285ce5
Author: Ankur Dave <[email protected]>
Date: 2015-08-18T06:38:30Z
Correctness fixes for join elimination
Do not eliminate referential integrity full outer joins, or inner joins
where foreign key is
nullable. Require foreign keys to reference unique columns.
commit 945e5231e900621c4a2bbf103816385d68abd5e0
Author: Ankur Dave <[email protected]>
Date: 2015-08-19T06:15:31Z
Do key hint resolution during analysis
This is necessary to support aliased self joins and multiple foreign keys
with the same referent.
commit 504c9d858b8b35ed788e31bf99fc5f6506be792d
Author: Ankur Dave <[email protected]>
Date: 2015-08-19T06:18:02Z
Don't crash when foreign key refers to unresolved relation
Instead just leave the KeyHint unresolved.
commit 83c8ff913dc06f79ce059906e62b0e744967c1e4
Author: Ankur Dave <[email protected]>
Date: 2015-08-19T07:42:04Z
Fix JoinEliminationSuite
commit 0b0b8401f97bf52dabacfa818fa62a4477ca4c72
Author: Ankur Dave <[email protected]>
Date: 2015-08-19T11:01:43Z
Merge remote-tracking branch 'apache-spark/master' into GraphFrames
commit 9150ddaf2d598314ff3ea1fe4a434de37325d213
Author: Ankur Dave <[email protected]>
Date: 2015-08-19T12:14:53Z
Fix KeyHintSuite after merge
commit 873b3224b043875718959c645146743ed78084da
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T01:47:47Z
In ForeignKey, store referencedRelation as logical plan
Previously we stored its name as part of referencedAttr, requiring a
catalog lookup.
commit 98e0b5e316b1692a188dedc6b49daaa5854a064b
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T02:45:21Z
Use semanticEquals for Attributes
commit d43a2c005b091e571a9d5dc3cc7d22e22a29ffd0
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T03:37:35Z
Remove TODOs
commit f4e7e0140865df27f3c0b000f22d69117316070e
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T04:02:02Z
Add more comments
commit 49b196e041c80c83eef0b069c984e608cc6433b5
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T04:13:46Z
Merge remote-tracking branch 'apache-spark/master' into GraphFrames
commit 578797c456e20d0fb07bf10cb3e64f09065948f9
Author: Ankur Dave <[email protected]>
Date: 2015-10-13T04:38:46Z
Use SharedSQLContext in KeyHintSuite
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]