GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/15238
[SPARK-17653][SQL] Remove unnecessary distincts in multiple unions
## What changes were proposed in this pull request?
Currently for `Union [Distinct]`, a `Distinct` operator is necessary to be
on the top of `Union`. Once there are adjacent `Union [Distinct]`, there will
be multiple `Distinct` in the query plan.
E.g.,
For a query like: select 1 a union select 2 b union select 3 c
Before this patch, its physical plan looks like:
*HashAggregate(keys=[a#13], functions=[])
+- Exchange hashpartitioning(a#13, 200)
+- *HashAggregate(keys=[a#13], functions=[])
+- Union
:- *HashAggregate(keys=[a#13], functions=[])
: +- Exchange hashpartitioning(a#13, 200)
: +- *HashAggregate(keys=[a#13], functions=[])
: +- Union
: :- *Project [1 AS a#13]
: : +- Scan OneRowRelation[]
: +- *Project [2 AS b#14]
: +- Scan OneRowRelation[]
+- *Project [3 AS c#15]
+- Scan OneRowRelation[]
Only the top distinct should be necessary.
After this patch, the physical plan looks like:
*HashAggregate(keys=[a#221], functions=[], output=[a#221])
+- Exchange hashpartitioning(a#221, 5)
+- *HashAggregate(keys=[a#221], functions=[], output=[a#221])
+- Union
:- *Project [1 AS a#221]
: +- Scan OneRowRelation[]
:- *Project [2 AS b#222]
: +- Scan OneRowRelation[]
+- *Project [3 AS c#223]
+- Scan OneRowRelation[]
## How was this patch tested?
Jenkins tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 remove-extra-distinct-union
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15238.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15238
----
commit c770a9a9948c301a831daa555360702c73542aa2
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-09-26T03:37:46Z
Remove unnecessary distincts in multiple unions.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]