GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/17929
[SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate
without grouping
## What changes were proposed in this pull request?
The query
```
SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
```
should return a single row of output because the subquery is an aggregate
without a group-by and thus should return a single row. However, Spark
incorrectly returns zero rows.
This is caused by SPARK-16208 / #13906, a patch which added an optimizer
rule to propagate EmptyRelation through operators. The logic for handling
aggregates is wrong: it checks whether aggregate expressions are non-empty for
deciding whether the output should be empty, whereas it should be checking
grouping expressions instead:
An aggregate with non-empty grouping expression will return one output row
per group. If the input to the grouped aggregate is empty then all groups will
be empty and thus the output will be empty. It doesn't matter whether the
aggregation output columns include aggregate expressions since that won't
affect the number of output rows.
If the grouping expressions are empty, however, then the aggregate will
always produce a single output row and thus we cannot propagate the
EmptyRelation.
The current implementation is incorrect and also misses an optimization
opportunity by not propagating EmptyRelation in the case where a grouped
aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from
emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the
old code, even though it safely could be).
This patch resolves this issue by modifying `PropagateEmptyRelation` to
consider only the presence/absence of grouping expressions, not the aggregate
functions themselves, when deciding whether to propagate EmptyRelation.
## How was this patch tested?
- Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
- Updated unit tests in `PropagateEmptyRelationSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark fix-PropagateEmptyRelation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17929.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17929
----
commit 8b8ef4be23945f51eb6644149f8786a7b95d06b0
Author: Josh Rosen <[email protected]>
Date: 2017-05-10T00:30:06Z
Add failing SQLQueryTest regression test.
commit 37b0c1bda88a51487fa80accc66a6d38118ebe93
Author: Josh Rosen <[email protected]>
Date: 2017-05-10T00:38:49Z
Update optimizer tests to also fail.
commit 7a921bb46164762710cc58a20952486aa7556b59
Author: Josh Rosen <[email protected]>
Date: 2017-05-10T00:39:23Z
Implement fix.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]