Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

via GitHub Sat, 22 Feb 2025 17:11:13 -0800


rekbun commented on code in PR #45267:
URL: https://github.com/apache/spark/pull/45267#discussion_r1966642543



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##########
@@ -164,6 +165,18 @@ case class BatchScanExec(
               (groupedParts, expressions)
           }
 
+          // Also re-group the partitions if we are reducing compatible 
partition expressions
+          val finalGroupedPartitions = spjParams.reducers match {

Review Comment:
   I believe this could produce incorrect result when joining presorted 
bucketed tables with compatible bucket counts. 
   
   Specifically, if we have two tables:
   1. Bucketed and sorted on the same join keys
   2. With different bucket counts, where one table's bucket count is a 
multiple of the other
   
   
   When performing a bucketed join in Spark, it's expected that the sort order 
should be preserved. However, it appears that the merging or grouping process 
involved in the join might break these sorting guarantees, leading to incorrect 
results.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal [spark]

Reply via email to