Re: [PR] [HUDI-9205] Introduce a representative file containing the estimated total size of file slice [hudi]

via GitHub Mon, 07 Apr 2025 00:56:03 -0700


TheR1sing3un commented on code in PR #13070:
URL: https://github.com/apache/hudi/pull/13070#discussion_r2030659071



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -169,26 +169,29 @@ case class HoodieFileIndex(spark: SparkSession,
     val prunedPartitionsAndFilteredFileSlices = filterFileSlices(dataFilters, 
partitionFilters).map {
       case (partitionOpt, fileSlices) =>
         if (shouldEmbedFileSlices) {
-          val baseFileStatusesAndLogFileOnly: Seq[FileStatus] = 
fileSlices.map(slice => {
-            if (slice.getBaseFile.isPresent) {
+          val logFileEstimationFraction = 
options.getOrElse(HoodieStorageConfig.LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION.key(),
+            
HoodieStorageConfig.LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION.defaultValue()).toDouble
+          // 1. Generate a disguised representative file for each file slice, 
which spark uses to optimize rdd partition parallelism based on data such as 
file size
+          // For file slice only has base file, we directly use the base file 
size as representative file size
+          // For file slice has log file, we estimate the representative file 
size based on the log file size and option(base file) size
+          val representFiles = fileSlices.map(slice => {
+            val estimationFileSize = 
FileSliceUtils.getTotalFileSizeAsParquetFormat(slice, logFileEstimationFraction)
+            val fileInfo = if (slice.getBaseFile.isPresent) {
               slice.getBaseFile.get().getPathInfo
-            } else if (slice.hasLogFiles) {
-              slice.getLogFiles.findAny().get().getPathInfo
             } else {
-              null
+              slice.getLogFiles.findAny().get().getPathInfo
             }
-          }).filter(slice => slice != null)
-            .map(fileInfo => new FileStatus(fileInfo.getLength, 
fileInfo.isDirectory, 0, fileInfo.getBlockSize,
-              fileInfo.getModificationTime, new Path(fileInfo.getPath.toUri)))
+            new FileStatus(estimationFileSize, fileInfo.isDirectory, 0, 
fileInfo.getBlockSize, fileInfo.getModificationTime, new 
Path(fileInfo.getPath.toUri))

Review Comment:
   > Does this affect reading the file because the file status is manipulated? 
Is there a different way of letting Spark know the file / partitioned file size 
estimation?
   
   The actual read process is also controlled by hudi code. spark only does 
some parallelism related optimization through this file status, and the actual 
read is still controlled by hudi.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9205] Introduce a representative file containing the estimated total size of file slice [hudi]

Reply via email to