hhhizzz opened a new pull request, #23028:
URL: https://github.com/apache/datafusion/pull/23028

   ## Which issue does this PR close?
   
   - Closes #23027.
   
   ## Rationale for this change
   
   After primary key constraints were added to TPC-DS schemas, SQL aggregate 
planning could expand grouped primary key columns with all functionally 
dependent columns from the input schema.
   
   For queries such as TPC-DS q39, many of those dependent columns are not 
needed after aggregation. Carrying them as aggregate group keys widens the 
grouping payload, forces extra scan/join projections, and can cause a large 
performance regression.
   
   Functional dependencies should still allow selecting columns determined by 
grouped keys, but aggregate planning only needs to add dependent columns that 
are actually referenced after aggregation.
   
   ## What changes are included in this PR?
   
   - Changes SQL aggregate planning to add functionally dependent group 
expressions only when they are required by post-aggregate expressions.
   - Tracks required columns from SELECT, HAVING, QUALIFY, ORDER BY, and 
DISTINCT ON expressions, ignoring columns referenced only inside aggregate 
functions.
   - Keeps referenced functionally dependent columns available for 
post-aggregate projection/filter/sort/distinct behavior.
   - Avoids adding unreferenced functionally dependent columns to aggregate 
group keys.
   - Adds focused sqllogictests for FD group key pruning and required-column 
retention.
   - Updates existing group-by error message expectations for the narrower 
aggregate output.
   
   ## Are these changes tested?
   
   Yes.
   
   New sqllogictest coverage was added in `group_by_fd_prune.slt` for:
   
   - unreferenced functionally dependent columns are not appended to aggregate 
group keys
   - SELECT references keep required FD columns available
   - HAVING references keep required FD columns available
   - ORDER BY hidden references keep required FD columns available
   - DISTINCT ON references keep required FD columns available
   - ordinal GROUP BY resolution still works
   
   Verification run locally:
   
   ```text
   cargo metadata --format-version 1 --locked
   cargo fmt --all -- --check
   cargo test --test sqllogictests -- group_by_fd_prune
   cargo test -p datafusion --test tpcds_planning tpcds_logical_q39
   ```
   
   ## Are there any user-facing changes?
   
   No SQL syntax or public API changes.
   
   Users may observe narrower optimized plans and improved performance for 
affected aggregate queries that group by determinant keys with functional 
dependencies.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to