[GitHub] [hudi] psendyk commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

GitBox Tue, 15 Nov 2022 09:42:32 -0800


psendyk commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1023090340



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
     resetTableMetadata(null);
   }
 
+  protected String[] getPartitionColumns() {
+    return partitionColumns;
+  }
+
+  protected List<Path> getQueryPaths() {
+    return queryPaths;
+  }
+
+  /**
+   * Returns all partition paths matching the ones explicitly provided by the 
query (if any)
+   */
   protected List<PartitionPath> getAllQueryPartitionPaths() {
-    List<String> queryRelativePartitionPaths = queryPaths.stream()
-        .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
-        .collect(Collectors.toList());
+    if (cachedAllPartitionPaths == null) {
+      List<String> queryRelativePartitionPaths = queryPaths.stream()
+          .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+          .collect(Collectors.toList());
 
-    // Load all the partition path from the basePath, and filter by the query 
partition path.
-    // TODO load files from the queryRelativePartitionPaths directly.
-    List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
-        .stream()
-        .filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
-        .collect(Collectors.toList());
+      this.cachedAllPartitionPaths = 
listPartitionPaths(queryRelativePartitionPaths);
+    }
+
+    return cachedAllPartitionPaths;
+  }
+
+  /**
+   * Returns all listed file-slices w/in the partition paths returned by 
{@link #getAllQueryPartitionPaths()}
+   */
+  protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+    if (!areAllFileSlicesCached()) {
+      // Fetching file slices for partitions that have not been cached yet
+      List<PartitionPath> missingPartitions = 
getAllQueryPartitionPaths().stream()
+          .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+          .collect(Collectors.toList());
+
+      // NOTE: Individual partitions are always cached in full, therefore if 
partition is cached
+      //       it will hold all the file-slices residing w/in the partition
+      
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+    }
+
+    return cachedAllInputFileSlices;
+  }
+
+  /**
+   * Get input file slice for the given partition. Will use cache directly if 
it is computed before.
+   */
+  protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+    return cachedAllInputFileSlices.computeIfAbsent(partition,
+        p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+  }
+
+  private Map<PartitionPath, List<FileSlice>> 
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+    Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+        .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+    HoodieTimeline activeTimeline = getActiveTimeline();
+    Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+    FileStatus[] allFiles = 
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+    HoodieTableFileSystemView fileSystemView = new 
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);

Review Comment:
   @YuweiXiao Please correct me if I'm wrong but I think that since 
`loadFileSlicesForPartitions` is called separately for each data partition 
after pruning, when using the MT, the FILES partition needs to be 
re-constructed from scratch when listing files in each data partition, one by 
one. This can become pretty expensive when listing multiple partitions as the 
merging of the base and delta files in the FILES partition is expensive. Would 
it be possible to either (a) list all pruned partitions in a single call, 
similarly to how the eager implementation (`doRefresh()`) does it, or (b) only 
initialize the `HoodieTableFileSystemView` once for listing all partitions in a 
single query? Anyways, this is a massive improvement to the query listing 
performance so thank you for implementing it, just thought I'd raise this as 
when I tested this branch in my env I actually got better listing perf with MT 
disabled. I can provide the driver logs if helpful where you can see the FILES 
partition be
 ing loaded when listing each data partition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] psendyk commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

Reply via email to