chirag-s-db commented on code in PR #53098:
URL: https://github.com/apache/spark/pull/53098#discussion_r2538735177


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala:
##########
@@ -140,6 +140,13 @@ case class EnsureRequirements(
       // Choose all the specs that can be used to shuffle other children
       val candidateSpecs = specs
           .filter(_._2.canCreatePartitioning)
+          .filter {

Review Comment:
   `checkKeyGroupCompatible` applies to the case where we have 2 
KeyGroupedPartitioned scans that are being joined against each other. For 
example, something like:
   ```
   SortMergeJoinExec ...
     +- BatchScanExec tbl1 ... -> reporting KeyGroupedPartitioning
     +- BatchScanExec tbl2 ... -> reporting KeyGroupedPartitioning
   ```
   
   If one child is not KeyGroupedPartitioned, we can still avoid the shuffle 
for one child (in general):
   ```
   SortMergeJoinExec ...
     +- BatchScanExec tbl1 ... -> reporting KeyGroupedPartitioning
     +- ShuffleExchangeExec KeyGroupedPartitioning
       +- BatchScanExec tbl2 ... -> reporting UnknownPartitioning
   ```
   
   However, if the child reporting the KeyGroupedPartitioning is not a 
BatchScanExec, then we can't safely push down the JOIN keys, making it unsafe 
to do this. This may arise if we call `.checkpoint()` on a `BatchScanExec`:
   ```
   SortMergeJoinExec ...
     +- RDDScanExec ... -> reporting KeyGroupedPartitioning (coming from ckpt 
of tbl1 scan)
     +- ShuffleExchangeExec KeyGroupedPartitioning
       +- BatchScanExec tbl2 ... -> reporting UnknownPartitioning
   ```
   
   This extra check is for this second case, where we want to make sure that 
we're not using a KeyGroupedPartitioning to shuffle another child of a JOIN 
without being able to push down JOIN keys. The test "SPARK-53322: checkpointed 
scans can't shuffle other children on SPJ" is for this case, and will fail 
without this change. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to