Re: [PR] Optimize away unused `UNNEST` under duplicate-insensitive aggregates [datafusion]

via GitHub Fri, 29 May 2026 20:16:30 -0700


kosiew commented on PR #22161:
URL: https://github.com/apache/datafusion/pull/22161#issuecomment-4581465236


   @alamb 
   I made changes to address the above comments.
   
   > don't fully undesrtand the transformation and don't quite follow why it is 
always correct (as in how sure are we that this won't get wrong answers)
   
   This rewrite is intentionally much narrower than general “unused operator” 
pruning.
   
   The transformation only fires for an aggregate with no aggregate 
expressions, i.e. a duplicate-insensitive grouping-only aggregate. It also 
requires:
   
   - no required/group expression references any unnested output column
   - no required/group expression is volatile
   - the `UNNEST` is proven row-preserving: currently only depth-1 literal 
lists with a valid, non-empty first value
   - empty/null lists, recursive unnests, non-literal unnests, referenced 
unnested columns, and visible aggregate expressions are all rejected
   
   So the matched case is effectively:
   
   ```sql
   SELECT id
   FROM (
     SELECT id, UNNEST([1, 2]) AS elem
     FROM t
   )
   GROUP BY id
   ```
   
   which can become:
   
   ```sql
   SELECT id
   FROM t
   GROUP BY id
   ```
   
   because `elem` is unused, every input row still contributes at least one row 
after unnest, and grouping by `id` collapses any duplicate rows introduced by 
unnest.
   
   We added regression tests for the unsafe cases too: empty/null lists, 
recursive unnest, non-literal list inputs, referenced unnested output, visible 
aggregates like `COUNT(*)`, and volatile grouping expressions such as `uuid()`. 
Those cases keep the `UNNEST`.
   
   I’ll add/expand the code comment to make this invariant clearer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize away unused `UNNEST` under duplicate-insensitive aggregates [datafusion]

Reply via email to