liorregev opened a new pull request, #45093: URL: https://github.com/apache/spark/pull/45093
AliasAwareOutputExpression does not detect that `select(F.struct($"my_field"))` retains partitioning in case the dataset was partitioning by `$"my_field"` before the select. This causes an additional shuffle to be added when using `joinWith` on datasets that were already partitioned accordingly. ### What changes were proposed in this pull request? AliasAwareOutputExpression should respect struct fields when returning `outputPartitioning` ### Why are the changes needed? Extra shuffles are bad and slow down my pipeline. Would like them gone please ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit tests that covers the scenario ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
