Davis Zhang created HUDI-9787:
---------------------------------
Summary: HoodieFileIndex file listing perf issue
Key: HUDI-9787
URL: https://issues.apache.org/jira/browse/HUDI-9787
Project: Apache Hudi
Issue Type: Bug
Components: index
Reporter: Davis Zhang
Fix For: 1.2.0
Attachments: image-2025-09-04-10-23-53-374.png
org.apache.hudi.HoodieFileIndex#filterFileSlices
{code:java}
val prunedPartitionsAndFilteredFileSlices = prunedPartitionsAndFileSlices.map {
case (partitionOpt, fileSlices) =>
// Filter in candidate files based on the col-stats or record level index
lookup
val candidateFileSlices: Seq[FileSlice] = {
fileSlices.filter(fs => {
val fileSliceFiles =
fs.getLogFiles.map[String](JFunction.toJavaFunction[HoodieLogFile, String](lf
=> lf.getPath.getName))
.collect(Collectors.toSet[String])
val baseFileStatusOpt =
getBaseFileStatus(Option.apply(fs.getBaseFile.orElse(null)))
baseFileStatusOpt.exists(f => fileSliceFiles.add(f.getPath.getName))
// NOTE: This predicate is true when {@code Option} is empty
candidateFilesNamesOpt.forall(files => files.exists(elem =>
fileSliceFiles.contains(elem)))
})
}
totalFileSliceSize += fileSlices.size
candidateFileSliceSize += candidateFileSlices.size
(partitionOpt, candidateFileSlices)
}
val skippingRatio =
if (!areAllFileSlicesCached) -1
else if (getAllFiles().nonEmpty && totalFileSliceSize > 0)
(totalFileSliceSize - candidateFileSliceSize) / totalFileSliceSize.toDouble
else 0
logInfo(s"Total file slices: $totalFileSliceSize; " +
s"candidate file slices after data skipping: $candidateFileSliceSize; " +
s"skipping percentage $skippingRatio")
{code}
This is doing a nested for loop like processing between 2 scala list
candidateFilesNamesOpt and prunedPartitionsAndFileSlices to figure out the
overlap.
In production use case (0.14.1 based branch), we saw that
!image-2025-09-04-10-23-53-374.png!
It means both list are of size 55.8k. The total iteration the above for loop is
55.8k*55.8k which is not trivial.
We noticed on a cluster with 32 GB memory with no CPU throttling (this logic is
anyway single threaded), in this set up the hoodie file index choose to use
COLUMN_STATS_INDEX_PROCESSING_MODE_IN_MEMORY which means everything is on
driver.
That nested for loop took ~10 min to complete. During this time, no spark tasks
are running because this is driver only logic and there is no spark tasks
available for execution. The performance impact could be more profound if the
auto scaling logic of user's spark application choose to toss away executors
and later scale them back. This churning would make the e2e spark application
perf worse.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)