Re: [PR] [SPARK-54335][SQL] Reducing skew in the number of file splits per partition [spark]

via GitHub Sun, 23 Nov 2025 22:53:07 -0800


VindhyaG commented on code in PR #53040:
URL: https://github.com/apache/spark/pull/53040#discussion_r2554564269



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala:
##########
@@ -75,7 +75,7 @@ object FilePartition extends SessionStateHelper with Logging {
     }
 
     // Assign files to partitions using "Next Fit Decreasing"
-    partitionedFiles.foreach { file =>
+    
partitionedFiles.sortBy(_.length)(implicitly[Ordering[Long]].reverse).foreach { 
file =>

Review Comment:
   The PR description mentions  currently the partition files are already 
sorted then why do we need to sort Seq[PartitionedFile] again? As far as I 
understand 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L888
 does that right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-54335][SQL] Reducing skew in the number of file splits per partition [spark]

Reply via email to