[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`

GitBox Sat, 16 Apr 2022 13:19:53 -0700


alexeykudinkin commented on code in PR #5337:
URL: https://github.com/apache/hudi/pull/5337#discussion_r851667284



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##########
@@ -84,21 +84,24 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 
   protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[HoodieBaseFileSplit] = {
     val partitions = listLatestBaseFiles(globPaths, partitionFilters, 
dataFilters)
-    val fileSplits = partitions.values.toSeq.flatMap { files =>
-      files.flatMap { file =>
-        // TODO move to adapter
-        // TODO fix, currently assuming parquet as underlying format
-        HoodieDataSourceHelper.splitFiles(
-          sparkSession = sparkSession,
-          file = file,
-          // TODO clarify why this is required
-          partitionValues = getPartitionColumnsAsInternalRow(file)
-        )
+    val fileSplits = partitions.values.toSeq
+      .flatMap { files =>
+        files.flatMap { file =>
+          // TODO fix, currently assuming parquet as underlying format
+          HoodieDataSourceHelper.splitFiles(
+            sparkSession = sparkSession,
+            file = file,
+            partitionValues = getPartitionColumnsAsInternalRow(file)
+          )
+        }
       }
-    }
+      // NOTE: It's important to order the splits in the reverse order of their
+      //       size so that we can subsequently bucket them in an efficient 
manner
+      .sortBy(_.length)(implicitly[Ordering[Long]].reverse)

Review Comment:
   We want to maintain parity w/ Spark behavior which simply does greedy 
packing (which is at most 2x less efficient than optimal)
   
   @nsivabalan fair call, i agree that `getFilePartitions` seems like more 
appropriate place for this (and if it would be placed there in Spark itself we 
wouldn't have to fix it ourselves), but the reason i'm placing it in here is to 
keep our code (which is essentially copy of Spark's) in sync (at least for now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5337: [HUDI-3891] Fixing files partitioning sequence for `BaseFileOnlyRelation`

Reply via email to