hhhizzz opened a new pull request, #23028: URL: https://github.com/apache/datafusion/pull/23028
## Which issue does this PR close? - Closes #23027. ## Rationale for this change After primary key constraints were added to TPC-DS schemas, SQL aggregate planning could expand grouped primary key columns with all functionally dependent columns from the input schema. For queries such as TPC-DS q39, many of those dependent columns are not needed after aggregation. Carrying them as aggregate group keys widens the grouping payload, forces extra scan/join projections, and can cause a large performance regression. Functional dependencies should still allow selecting columns determined by grouped keys, but aggregate planning only needs to add dependent columns that are actually referenced after aggregation. ## What changes are included in this PR? - Changes SQL aggregate planning to add functionally dependent group expressions only when they are required by post-aggregate expressions. - Tracks required columns from SELECT, HAVING, QUALIFY, ORDER BY, and DISTINCT ON expressions, ignoring columns referenced only inside aggregate functions. - Keeps referenced functionally dependent columns available for post-aggregate projection/filter/sort/distinct behavior. - Avoids adding unreferenced functionally dependent columns to aggregate group keys. - Adds focused sqllogictests for FD group key pruning and required-column retention. - Updates existing group-by error message expectations for the narrower aggregate output. ## Are these changes tested? Yes. New sqllogictest coverage was added in `group_by_fd_prune.slt` for: - unreferenced functionally dependent columns are not appended to aggregate group keys - SELECT references keep required FD columns available - HAVING references keep required FD columns available - ORDER BY hidden references keep required FD columns available - DISTINCT ON references keep required FD columns available - ordinal GROUP BY resolution still works Verification run locally: ```text cargo metadata --format-version 1 --locked cargo fmt --all -- --check cargo test --test sqllogictests -- group_by_fd_prune cargo test -p datafusion --test tpcds_planning tpcds_logical_q39 ``` ## Are there any user-facing changes? No SQL syntax or public API changes. Users may observe narrower optimized plans and improved performance for affected aggregate queries that group by determinant keys with functional dependencies. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
