[PR] [SPARK-47037] ] Fix AliasAwareOutputExpression outputPartitioning [spark]

via GitHub Tue, 13 Feb 2024 14:15:57 -0800


liorregev opened a new pull request, #45093:
URL: https://github.com/apache/spark/pull/45093


   AliasAwareOutputExpression does not detect that 
`select(F.struct($"my_field"))` retains partitioning in case the dataset was 
partitioning by `$"my_field"` before the select.
   This causes an additional shuffle to be added when using `joinWith` on 
datasets that were already partitioned accordingly.
   
   
   ### What changes were proposed in this pull request?
   AliasAwareOutputExpression should respect struct fields when returning 
`outputPartitioning`
   
   
   ### Why are the changes needed?
   Extra shuffles are bad and slow down my pipeline. Would like them gone please
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added a unit tests that covers the scenario
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-47037] ] Fix AliasAwareOutputExpression outputPartitioning [spark]

Reply via email to