peter-toth commented on PR #53859: URL: https://github.com/apache/spark/pull/53859#issuecomment-3822791189
Please note that not only checkpointed RDDs, but cached RDDs would also need an extra shuffle. Actually, I wonder if partition grouping by key is at the right place in `BatchScanExec` or it could be a separate operator that does the grouping. The operator should reside between a consumer that requires `ClusteredDistribution` and a producer that provides "partitions with keys" partitioning. `EnsureRequirements` could insert such operators if needed similarly to how it inserts exchanges now. As it is would be a new operator it could be inserted on the top of `BatchScanExec` and `LogicalRDD` (checkpointed plan) and `InMemoryTableScanExec` (cached plan) as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
