[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

GitBox Wed, 16 Nov 2022 17:36:09 -0800


YuweiXiao commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1024669448



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
     resetTableMetadata(null);
   }
 
+  protected String[] getPartitionColumns() {
+    return partitionColumns;
+  }
+
+  protected List<Path> getQueryPaths() {
+    return queryPaths;
+  }
+
+  /**
+   * Returns all partition paths matching the ones explicitly provided by the 
query (if any)
+   */
   protected List<PartitionPath> getAllQueryPartitionPaths() {
-    List<String> queryRelativePartitionPaths = queryPaths.stream()
-        .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
-        .collect(Collectors.toList());
+    if (cachedAllPartitionPaths == null) {
+      List<String> queryRelativePartitionPaths = queryPaths.stream()
+          .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+          .collect(Collectors.toList());
 
-    // Load all the partition path from the basePath, and filter by the query 
partition path.
-    // TODO load files from the queryRelativePartitionPaths directly.
-    List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
-        .stream()
-        .filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
-        .collect(Collectors.toList());
+      this.cachedAllPartitionPaths = 
listPartitionPaths(queryRelativePartitionPaths);
+    }
+
+    return cachedAllPartitionPaths;
+  }
+
+  /**
+   * Returns all listed file-slices w/in the partition paths returned by 
{@link #getAllQueryPartitionPaths()}
+   */
+  protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+    if (!areAllFileSlicesCached()) {
+      // Fetching file slices for partitions that have not been cached yet
+      List<PartitionPath> missingPartitions = 
getAllQueryPartitionPaths().stream()
+          .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+          .collect(Collectors.toList());
+
+      // NOTE: Individual partitions are always cached in full, therefore if 
partition is cached
+      //       it will hold all the file-slices residing w/in the partition
+      
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+    }
+
+    return cachedAllInputFileSlices;
+  }
+
+  /**
+   * Get input file slice for the given partition. Will use cache directly if 
it is computed before.
+   */
+  protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+    return cachedAllInputFileSlices.computeIfAbsent(partition,
+        p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+  }
+
+  private Map<PartitionPath, List<FileSlice>> 
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+    Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+        .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+    HoodieTimeline activeTimeline = getActiveTimeline();
+    Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+    FileStatus[] allFiles = 
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+    HoodieTableFileSystemView fileSystemView = new 
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);

Review Comment:
   Yeah, `partial lookup` should be the next step. The current algorithm is 
more like a `prefix lookup`. 
   
   Just to make sure I understand you correctly, `partial lookup` means 
exploiting the knowledge of the predicates during the partition listing. Taking 
a table with two partition columns (e.g., year & month) as an example. If the 
predicate is on the month column only, e.g., `month=03`, the current `prefix 
lookup` still have to do a full listing. While the `partial lookup` could do a 
single listing on base path, and then padding the result with `month=03`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YuweiXiao commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

Reply via email to