kosiew commented on PR #22161:
URL: https://github.com/apache/datafusion/pull/22161#issuecomment-4581465236
@alamb
I made changes to address the above comments.
> don't fully undesrtand the transformation and don't quite follow why it is
always correct (as in how sure are we that this won't get wrong answers)
This rewrite is intentionally much narrower than general “unused operator”
pruning.
The transformation only fires for an aggregate with no aggregate
expressions, i.e. a duplicate-insensitive grouping-only aggregate. It also
requires:
- no required/group expression references any unnested output column
- no required/group expression is volatile
- the `UNNEST` is proven row-preserving: currently only depth-1 literal
lists with a valid, non-empty first value
- empty/null lists, recursive unnests, non-literal unnests, referenced
unnested columns, and visible aggregate expressions are all rejected
So the matched case is effectively:
```sql
SELECT id
FROM (
SELECT id, UNNEST([1, 2]) AS elem
FROM t
)
GROUP BY id
```
which can become:
```sql
SELECT id
FROM t
GROUP BY id
```
because `elem` is unused, every input row still contributes at least one row
after unnest, and grouping by `id` collapses any duplicate rows introduced by
unnest.
We added regression tests for the unsafe cases too: empty/null lists,
recursive unnest, non-literal list inputs, referenced unnested output, visible
aggregates like `COUNT(*)`, and volatile grouping expressions such as `uuid()`.
Those cases keep the `UNNEST`.
I’ll add/expand the code comment to make this invariant clearer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]