[GitHub] spark pull request: [SPARK-14785][SQL] Support correlated scalar s...

hvanhovell Sun, 01 May 2016 04:32:55 -0700

GitHub user hvanhovell opened a pull request:

    https://github.com/apache/spark/pull/12822


    [SPARK-14785][SQL] Support correlated scalar subqueries 

    ## What changes were proposed in this pull request?
    In this PR we add support for correlated scalar subqueries. An example of 
such a query is:
    ```SQL
    select * from tbl1 a where a.value > (select max(value) from tbl2 b where 
b.key = a.key)  
    ```
    The implementation adds the `RewriteCorrelatedScalarSubquery` rule to the 
Optimizer. This rule plans these subqueries using `LEFT OUTER` joins. It 
currently supports rewrites for `Project`, `Aggregate` & `Filter` logical plans.
    
    I could not find a well defined semantics for the use of scalar subqueries 
in an `Aggregate`. The current implementation currently evaluates the scalar 
subquery *before* aggregation. This means that you either have to make scalar 
subquery part of the grouping expression, or that you have to aggregate it 
further on. I am open to suggestions on this.
    
    The PR currently does not enforce the uniqueness of the result given the 
keys. This means that an ill-defined scalar subquery can cause more rows to be 
produced than expected (like a LEFT OUTER join would). This can be quite 
confusing in `Filter` clauses. We could limit the use of the correlated 
subqueries to be aggregates, but I also feel that user should be given the 
option to make that choice. I am again open to suggestions. 
    
    ## How was this patch tested?
    Added tests to `SubquerySuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hvanhovell/spark SPARK-14785

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12822.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12822
    
----
commit 18270752d3be461c8e14a337f00ebb8a40d6493f
Author: Herman van Hovell <[email protected]>
Date:   2016-05-01T11:13:18Z

    Add correlated scalar subqueries.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14785][SQL] Support correlated scalar s...

Reply via email to