peter-toth opened a new pull request, #53770: URL: https://github.com/apache/spark/pull/53770
What changes were proposed in this pull request? Currently, SPJ logic can apply partial clustering (when enabled) to either side of an inner JOIN as long as the nodes between the scan and JOIN preserve partitioning. This doesn't work if one of these nodes is using the scan's key-grouped partitioning to satisfy its required distribution (for example, a grouping agg or window function). This PR avoids this issue by avoiding applying a partially clustered distribution to a JOIN's child if any node in that child relies on the KeyGroupedPartitioning to satisfy its required distribution (since it's not safe to do so with a partially clustered distribution). Why are the changes needed? Without this fix, using a partially-clustered distribution with SPJ may cause correctness issues. Does this PR introduce any user-facing change? No. How was this patch tested? See test changes. Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
