psendyk commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1023403989
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
resetTableMetadata(null);
}
+ protected String[] getPartitionColumns() {
+ return partitionColumns;
+ }
+
+ protected List<Path> getQueryPaths() {
+ return queryPaths;
+ }
+
+ /**
+ * Returns all partition paths matching the ones explicitly provided by the
query (if any)
+ */
protected List<PartitionPath> getAllQueryPartitionPaths() {
- List<String> queryRelativePartitionPaths = queryPaths.stream()
- .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
- .collect(Collectors.toList());
+ if (cachedAllPartitionPaths == null) {
+ List<String> queryRelativePartitionPaths = queryPaths.stream()
+ .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+ .collect(Collectors.toList());
- // Load all the partition path from the basePath, and filter by the query
partition path.
- // TODO load files from the queryRelativePartitionPaths directly.
- List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
- .stream()
- .filter(path ->
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
- .collect(Collectors.toList());
+ this.cachedAllPartitionPaths =
listPartitionPaths(queryRelativePartitionPaths);
+ }
+
+ return cachedAllPartitionPaths;
+ }
+
+ /**
+ * Returns all listed file-slices w/in the partition paths returned by
{@link #getAllQueryPartitionPaths()}
+ */
+ protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+ if (!areAllFileSlicesCached()) {
+ // Fetching file slices for partitions that have not been cached yet
+ List<PartitionPath> missingPartitions =
getAllQueryPartitionPaths().stream()
+ .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+ .collect(Collectors.toList());
+
+ // NOTE: Individual partitions are always cached in full, therefore if
partition is cached
+ // it will hold all the file-slices residing w/in the partition
+
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+ }
+
+ return cachedAllInputFileSlices;
+ }
+
+ /**
+ * Get input file slice for the given partition. Will use cache directly if
it is computed before.
+ */
+ protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+ return cachedAllInputFileSlices.computeIfAbsent(partition,
+ p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+ }
+
+ private Map<PartitionPath, List<FileSlice>>
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+ Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+ .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+ HoodieTimeline activeTimeline = getActiveTimeline();
+ Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+ FileStatus[] allFiles =
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+ HoodieTableFileSystemView fileSystemView = new
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);
Review Comment:
Yeah, that makes sense @YuweiXiao, I see how the caching works here. I
corrected my initial statement above, i.e. I believe this behavior is actually
triggered by calling `loadPartitionPathFiles` on L229 for each partition
individually, so before the `HoodieTableFileSystemView` is initialized with the
cached files on L234. If you go down the call chain starting from L229, you'll
find that in `BaseTableMetadata.fetchAllFilesInPartition`, the call to
`getRecordByKey` on L329, results in a call to `ensurePartitionLoadedCorrectly`
with the `files` partition. However, along the way in
`HoodieTableMetadataUtil.getPartitionFileSlices`, the filesystem view of the MT
is re-initialized without any cached files. I believe this forces the `files`
partitions to be recomputed each time. Each time the filesystem view is
initialized, `partitionToFileGroupsMap` is initialized. I see that
`partitionToFileGroupsMap` is being populated in `HoodieTableFileSystemView`
from the provided files if you use
the constructor on L177 but not in the one on L97 which is what
`HoodieTableMetadataUtil.getPartitionFileSlices` uses. Not trying to derail
this PR and please feel free to ignore if you don't the see the `files`
partition being recomputed in your env.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]