GitHub user hvanhovell opened a pull request:
https://github.com/apache/spark/pull/12822
[SPARK-14785][SQL] Support correlated scalar subqueries
## What changes were proposed in this pull request?
In this PR we add support for correlated scalar subqueries. An example of
such a query is:
```SQL
select * from tbl1 a where a.value > (select max(value) from tbl2 b where
b.key = a.key)
```
The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the
Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It
currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.
I could not find a well defined semantics for the use of scalar subqueries
in an `Aggregate`. The current implementation currently evaluates the scalar
subquery *before* aggregation. This means that you either have to make scalar
subquery part of the grouping expression, or that you have to aggregate it
further on. I am open to suggestions on this.
The PR currently does not enforce the uniqueness of the result given the
keys. This means that an ill-defined scalar subquery can cause more rows to be
produced than expected (like a LEFT OUTER join would). This can be quite
confusing in `Filter` clauses. We could limit the use of the correlated
subqueries to be aggregates, but I also feel that user should be given the
option to make that choice. I am again open to suggestions.
## How was this patch tested?
Added tests to `SubquerySuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hvanhovell/spark SPARK-14785
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12822.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12822
----
commit 18270752d3be461c8e14a337f00ebb8a40d6493f
Author: Herman van Hovell <[email protected]>
Date: 2016-05-01T11:13:18Z
Add correlated scalar subqueries.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]