GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/17929

    [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate 
without grouping

    ## What changes were proposed in this pull request?
    
    The query
    
    ```
    SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
    ```
    
    should return a single row of output because the subquery is an aggregate 
without a group-by and thus should return a single row. However, Spark 
incorrectly returns zero rows.
    
    This is caused by SPARK-16208 / #13906, a patch which added an optimizer 
rule to propagate EmptyRelation through operators. The logic for handling 
aggregates is wrong: it checks whether aggregate expressions are non-empty for 
deciding whether the output should be empty, whereas it should be checking 
grouping expressions instead:
    
    An aggregate with non-empty grouping expression will return one output row 
per group. If the input to the grouped aggregate is empty then all groups will 
be empty and thus the output will be empty. It doesn't matter whether the 
aggregation output columns include aggregate expressions since that won't 
affect the number of output rows.
    
    If the grouping expressions are empty, however, then the aggregate will 
always produce a single output row and thus we cannot propagate the 
EmptyRelation.
    
    The current implementation is incorrect and also misses an optimization 
opportunity by not propagating EmptyRelation in the case where a grouped 
aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from 
emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the 
old code, even though it safely could be).
    
    This patch resolves this issue by modifying `PropagateEmptyRelation` to 
consider only the presence/absence of grouping expressions, not the aggregate 
functions themselves, when deciding whether to propagate EmptyRelation.
    
    ## How was this patch tested?
    
    - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
    - Updated unit tests in `PropagateEmptyRelationSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark fix-PropagateEmptyRelation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17929.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17929
    
----
commit 8b8ef4be23945f51eb6644149f8786a7b95d06b0
Author: Josh Rosen <[email protected]>
Date:   2017-05-10T00:30:06Z

    Add failing SQLQueryTest regression test.

commit 37b0c1bda88a51487fa80accc66a6d38118ebe93
Author: Josh Rosen <[email protected]>
Date:   2017-05-10T00:38:49Z

    Update optimizer tests to also fail.

commit 7a921bb46164762710cc58a20952486aa7556b59
Author: Josh Rosen <[email protected]>
Date:   2017-05-10T00:39:23Z

    Implement fix.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to