[PR] [SPARK-53074][SQL][3.5] Avoid partial clustering in SPJ to meet a child's required distribution [spark]

via GitHub Mon, 12 Jan 2026 10:52:26 -0800


peter-toth opened a new pull request, #53770:
URL: https://github.com/apache/spark/pull/53770


   What changes were proposed in this pull request?
   Currently, SPJ logic can apply partial clustering (when enabled) to either 
side of an inner JOIN as long as the nodes between the scan and JOIN preserve 
partitioning. This doesn't work if one of these nodes is using the scan's 
key-grouped partitioning to satisfy its required distribution (for example, a 
grouping agg or window function).
   
   This PR avoids this issue by avoiding applying a partially clustered 
distribution to a JOIN's child if any node in that child relies on the 
KeyGroupedPartitioning to satisfy its required distribution (since it's not 
safe to do so with a partially clustered distribution).
   
   Why are the changes needed?
   Without this fix, using a partially-clustered distribution with SPJ may 
cause correctness issues.
   
   Does this PR introduce any user-facing change?
   No.
   
   How was this patch tested?
   See test changes.
   
   Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-53074][SQL][3.5] Avoid partial clustering in SPJ to meet a child's required distribution [spark]

Reply via email to