aokolnychyi commented on PR #34785: URL: https://github.com/apache/spark/pull/34785#issuecomment-1132014116
Thanks for the PR, @huaxingao. I think it is a great feature and it would be awesome to get it done. I spent some time thinking about this and have a few questions/proposals. If I understand correctly, we currently hard-code the number of shuffle partitions in `RepartitionByExpression`, which prohibits both coalescing and skew split optimizations. It seems reasonable to support cases when the requested distribution is best-effort but I also think there are valid cases when the distribution is required for correctness and it is actually the current API contract. What about extending `RequiredDistributionAndOrdering` to indicate the distribution is not strictly required? We can add some boolean method and default it to keep the existing behavior. If the distribution is required, we can still benefit from coalescing as I think `CoalesceShufflePartitions` and `AQEShuffleReadExec` would keep the original distribution in coalesce cases. That’s already a huge win. We can avoid too small files while keeping the requested distribution. I also agree about using `RebalancePartitions` when the distribution is not strictly required. What about extending `RebalancePartitions` to also support range partitioning? It currently supports only hash and round-robin. If we make that change, we will be able to remove unnecessary shuffles in the optimizer and keep the original distribution as long as there is no skew and we only coalesce. If there is a skew, an extra shuffle and changed distribution seems like a reasonable overhead. What does everybody else think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
