peter-toth commented on PR #53859:
URL: https://github.com/apache/spark/pull/53859#issuecomment-3822791189

   Please note that not only checkpointed RDDs, but cached RDDs would also need 
an extra shuffle.
   
   Actually, I wonder if partition grouping by key is at the right place in 
`BatchScanExec` or it could be a separate operator that does the grouping. The 
operator should reside between a consumer that requires `ClusteredDistribution` 
and a producer that provides "partitions with keys" partitioning. 
`EnsureRequirements` could insert such operators if needed similarly to how it 
inserts exchanges now. As it is would be a new operator it could be inserted on 
the top of `BatchScanExec` and `LogicalRDD` (checkpointed plan) and 
`InMemoryTableScanExec` (cached plan) as well.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to