[GitHub] [hudi] psendyk commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

GitBox Tue, 15 Nov 2022 18:43:58 -0800


psendyk commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1023448955



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
     resetTableMetadata(null);
   }
 
+  protected String[] getPartitionColumns() {
+    return partitionColumns;
+  }
+
+  protected List<Path> getQueryPaths() {
+    return queryPaths;
+  }
+
+  /**
+   * Returns all partition paths matching the ones explicitly provided by the 
query (if any)
+   */
   protected List<PartitionPath> getAllQueryPartitionPaths() {
-    List<String> queryRelativePartitionPaths = queryPaths.stream()
-        .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
-        .collect(Collectors.toList());
+    if (cachedAllPartitionPaths == null) {
+      List<String> queryRelativePartitionPaths = queryPaths.stream()
+          .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+          .collect(Collectors.toList());
 
-    // Load all the partition path from the basePath, and filter by the query 
partition path.
-    // TODO load files from the queryRelativePartitionPaths directly.
-    List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
-        .stream()
-        .filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
-        .collect(Collectors.toList());
+      this.cachedAllPartitionPaths = 
listPartitionPaths(queryRelativePartitionPaths);
+    }
+
+    return cachedAllPartitionPaths;
+  }
+
+  /**
+   * Returns all listed file-slices w/in the partition paths returned by 
{@link #getAllQueryPartitionPaths()}
+   */
+  protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+    if (!areAllFileSlicesCached()) {
+      // Fetching file slices for partitions that have not been cached yet
+      List<PartitionPath> missingPartitions = 
getAllQueryPartitionPaths().stream()
+          .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+          .collect(Collectors.toList());
+
+      // NOTE: Individual partitions are always cached in full, therefore if 
partition is cached
+      //       it will hold all the file-slices residing w/in the partition
+      
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+    }
+
+    return cachedAllInputFileSlices;
+  }
+
+  /**
+   * Get input file slice for the given partition. Will use cache directly if 
it is computed before.
+   */
+  protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+    return cachedAllInputFileSlices.computeIfAbsent(partition,
+        p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+  }
+
+  private Map<PartitionPath, List<FileSlice>> 
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+    Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+        .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+    HoodieTimeline activeTimeline = getActiveTimeline();
+    Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+    FileStatus[] allFiles = 
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+    HoodieTableFileSystemView fileSystemView = new 
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);

Review Comment:
   Gotcha, that makes sense. So from the perspective of the file listing, the 
issue is the number of partial lookups, right? Out of curiosity, would it be 
possible to do the lookups on a less granular level? For instance, if a table 
is partitioned by 2 columns but the query only provides one equality predicate 
with a partitioning column, the algorithm in this PR will first fetch all leaf 
partitions that match the predicate and then list the files in all leaf 
partitions one by one. Instead, could you do a single partial lookup just on 
the column provided in the predicate? I'm not too familiar with the internals 
of the files index so this is just for my own curiosity.
   
   Thank you both for the explanation and for implementing this. I'm still 
excited for this patch to land, even with MT disabled. For context, I have a 
table with a few PBs of data partitioned by 2 columns for a total of ~1.5M 
partitions. On 0.12.0 with MT enabled, the file listing takes ~15min regardless 
of the query. Most of the queries have an equality predicate on one of the 
columns, and when I tested this patch yesterday the listing with MT disabled 
only takes ~30s. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] psendyk commented on a diff in pull request #6680: [HUDI-4812] Lazy fetching partition path & file slice for HoodieFileIndex

Reply via email to