psendyk commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1023350011


##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
     resetTableMetadata(null);
   }
 
+  protected String[] getPartitionColumns() {
+    return partitionColumns;
+  }
+
+  protected List<Path> getQueryPaths() {
+    return queryPaths;
+  }
+
+  /**
+   * Returns all partition paths matching the ones explicitly provided by the 
query (if any)
+   */
   protected List<PartitionPath> getAllQueryPartitionPaths() {
-    List<String> queryRelativePartitionPaths = queryPaths.stream()
-        .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
-        .collect(Collectors.toList());
+    if (cachedAllPartitionPaths == null) {
+      List<String> queryRelativePartitionPaths = queryPaths.stream()
+          .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+          .collect(Collectors.toList());
 
-    // Load all the partition path from the basePath, and filter by the query 
partition path.
-    // TODO load files from the queryRelativePartitionPaths directly.
-    List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
-        .stream()
-        .filter(path -> 
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
-        .collect(Collectors.toList());
+      this.cachedAllPartitionPaths = 
listPartitionPaths(queryRelativePartitionPaths);
+    }
+
+    return cachedAllPartitionPaths;
+  }
+
+  /**
+   * Returns all listed file-slices w/in the partition paths returned by 
{@link #getAllQueryPartitionPaths()}
+   */
+  protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+    if (!areAllFileSlicesCached()) {
+      // Fetching file slices for partitions that have not been cached yet
+      List<PartitionPath> missingPartitions = 
getAllQueryPartitionPaths().stream()
+          .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+          .collect(Collectors.toList());
+
+      // NOTE: Individual partitions are always cached in full, therefore if 
partition is cached
+      //       it will hold all the file-slices residing w/in the partition
+      
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+    }
+
+    return cachedAllInputFileSlices;
+  }
+
+  /**
+   * Get input file slice for the given partition. Will use cache directly if 
it is computed before.
+   */
+  protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+    return cachedAllInputFileSlices.computeIfAbsent(partition,
+        p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+  }
+
+  private Map<PartitionPath, List<FileSlice>> 
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+    Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+        .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+    HoodieTimeline activeTimeline = getActiveTimeline();
+    Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+    FileStatus[] allFiles = 
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+    HoodieTableFileSystemView fileSystemView = new 
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);

Review Comment:
   Yes, you're right the `tableMetadata` is being reused but I'm not seeing how 
the caching of the file index works, would you mind pointing me to it? I spent 
some time in a debugger and it seems like this behavior is triggered when 
computing `allFiles` on L228-229, not initializing `HoodieTableFileSystemView` 
as I originally thought. This is just my poor attempt at debugging this with a 
very limited understanding of the codebase so apologies if I'm making some 
incorrect assumptions. I think it might be happening because 
`HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices` is being called 
with the `files` partition for each data partition listing operation, which 
then re-initializes the `HoodieTableFileSystemView` without any cached files. 
Then when `ensurePartitionLoadedCorrectly` is called with the `files` 
partition, `addedPartitions` is empty and the partition needs to be recomputed. 
There's also a chance something is wrong with my env but I'd suggest at least 
trying to re
 produce it in your env; if you can confirm the `files` partition isn't being 
recomputed for every data partition listing, that's great! Let me know if it 
works for you as expected and I'll move on to debugging my own env. In case you 
find it helpful, here's the stack trace from my debugger that I get for every 
data partition listing operation, where at the end 
(`ensurePartitionLoadedCorrectly`), the `files` partition is computed each time:
   ```
   AbstractTableFileSystemView.ensurePartitionLoadedCorrectly
   AbstractTableFileSystemView.getLatestMergedFileSlicesBeforeOrOn
   HoodieTableMetadataUtil.getPartitionFileSlices
   HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices
   HoodieBackedTableMetadata.getPartitionFileSliceToKeysMapping
   HoodieBackedTableMetadata.getRecordsByKeys
   HoodieBackedTableMetadata.getRecordByKey
   BaseTableMetadata.fetchAllFilesInPartition
   BaseTableMetadata.getAllFilesInPartition
   BaseHoodieTableFileIndex.loadPartitionPathFiles
   BaseHoodieTableFileIndex.loadFileSlicesForPartitions
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to