Re: [PR] chore: add timing logs for file index partition and file listing [hudi]

via GitHub Thu, 09 Apr 2026 13:49:55 -0700


yihua commented on code in PR #18417:
URL: https://github.com/apache/hudi/pull/18417#discussion_r3060579505



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -385,7 +394,13 @@ protected List<PartitionPath> 
listPartitionPaths(List<String> relativePartitionP
           HoodieTimeline timelineToQuery = findInstantsInRange();
           matchedPartitionPaths = 
TimelineUtils.getWrittenPartitions(timelineToQuery);
         } else {

Review Comment:
   🤖 The asymmetry seems reasonable to me — 
`TimelineUtils.getWrittenPartitions` just scans the already-loaded in-memory 
`HoodieTimeline` for committed instants and extracts partition paths from the 
instant metadata. It's a purely in-memory operation with no storage I/O, so 
it's extremely unlikely to be a performance bottleneck worth timing. The 
`getPartitionPathWithPathPrefixes` path, by contrast, hits the metadata table 
(or falls back to a filesystem listing), which is where latency typically lives.



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -357,18 +358,26 @@ private Map<PartitionPath, List<FileSlice>> 
filterFiles(List<PartitionPath> part
                       .orElseGet(() -> 
finalFileSystemView.getLatestFileSlices(partitionPath.path))
                       .collect(Collectors.toList())
           ));
+    } finally {
+      log.debug("On {} with query instant as {}, it took {}ms to filter {} 
files into file slices across {} partitions",
+          metaClient.getTableConfig().getTableName(), 
queryInstant.orElse("N/A"), timer.endTimer(),
+          allFiles.size(), partitions.size());
     }
   }
 
   protected List<PartitionPath> listPartitionPaths(List<String> 
relativePartitionPaths,
                                                    Types.RecordType 
partitionFields,
                                                    Expression 
partitionColumnPredicates) {
     List<String> matchedPartitionPaths;
+    HoodieTimer timer = HoodieTimer.start();
     try {
       matchedPartitionPaths = 
tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths,
           partitionFields, partitionColumnPredicates);
     } catch (IOException e) {
       throw new HoodieIOException("Error fetching partition paths", e);
+    } finally {
+      log.debug("On {}, it took {} ms to list partition paths with {} 
relativePartitionPaths and partition predicates",

Review Comment:
   🤖 Worth being careful here — logging an `Expression` object via `toString()` 
on a hot path could be surprisingly expensive or produce very verbose output 
depending on the expression type. If the author does add it, it would be safer 
to wrap it with `if (log.isDebugEnabled())` rather than relying on the logger 
to short-circuit it, since argument evaluation in Java happens before the 
method call.



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -476,6 +491,12 @@ private List<StoragePathInfo> 
listPartitionPathFiles(List<PartitionPath> partiti
     Set<StoragePath> missingPartitionPaths =
         CollectionUtils.diffSet(partitionPaths, cachedPartitionPaths.keySet());
 
+    if (missingPartitionPaths.isEmpty()) {
+      return cachedPartitionPaths.values().stream()
+          .flatMap(Collection::stream)
+          .collect(Collectors.toList());
+    }
+

Review Comment:
   🤖 The early return is functionally equivalent to the original code path. 
When `missingPartitionPaths` is empty, the original code would call 
`getAllFilesInPartitions` with an empty collection (returning an empty map), 
ingest nothing new into the cache, and assemble the result purely from 
`cachedPartitionPaths` — exactly what the early return yields. So this is a 
pure performance optimization with no behavioral difference, though the PR 
description's Impact section should indeed be updated to call it out explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] chore: add timing logs for file index partition and file listing [hudi]

Reply via email to