psendyk commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1023448955
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -179,17 +175,94 @@ public void close() throws Exception {
resetTableMetadata(null);
}
+ protected String[] getPartitionColumns() {
+ return partitionColumns;
+ }
+
+ protected List<Path> getQueryPaths() {
+ return queryPaths;
+ }
+
+ /**
+ * Returns all partition paths matching the ones explicitly provided by the
query (if any)
+ */
protected List<PartitionPath> getAllQueryPartitionPaths() {
- List<String> queryRelativePartitionPaths = queryPaths.stream()
- .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
- .collect(Collectors.toList());
+ if (cachedAllPartitionPaths == null) {
+ List<String> queryRelativePartitionPaths = queryPaths.stream()
+ .map(path -> FSUtils.getRelativePartitionPath(basePath, path))
+ .collect(Collectors.toList());
- // Load all the partition path from the basePath, and filter by the query
partition path.
- // TODO load files from the queryRelativePartitionPaths directly.
- List<String> matchedPartitionPaths = getAllPartitionPathsUnchecked()
- .stream()
- .filter(path ->
queryRelativePartitionPaths.stream().anyMatch(path::startsWith))
- .collect(Collectors.toList());
+ this.cachedAllPartitionPaths =
listPartitionPaths(queryRelativePartitionPaths);
+ }
+
+ return cachedAllPartitionPaths;
+ }
+
+ /**
+ * Returns all listed file-slices w/in the partition paths returned by
{@link #getAllQueryPartitionPaths()}
+ */
+ protected Map<PartitionPath, List<FileSlice>> getAllInputFileSlices() {
+ if (!areAllFileSlicesCached()) {
+ // Fetching file slices for partitions that have not been cached yet
+ List<PartitionPath> missingPartitions =
getAllQueryPartitionPaths().stream()
+ .filter(p -> !cachedAllInputFileSlices.containsKey(p))
+ .collect(Collectors.toList());
+
+ // NOTE: Individual partitions are always cached in full, therefore if
partition is cached
+ // it will hold all the file-slices residing w/in the partition
+
cachedAllInputFileSlices.putAll(loadFileSlicesForPartitions(missingPartitions));
+ }
+
+ return cachedAllInputFileSlices;
+ }
+
+ /**
+ * Get input file slice for the given partition. Will use cache directly if
it is computed before.
+ */
+ protected List<FileSlice> getInputFileSlices(PartitionPath partition) {
+ return cachedAllInputFileSlices.computeIfAbsent(partition,
+ p -> loadFileSlicesForPartitions(Collections.singletonList(p)).get(p));
+ }
+
+ private Map<PartitionPath, List<FileSlice>>
loadFileSlicesForPartitions(List<PartitionPath> partitions) {
+ Map<PartitionPath, FileStatus[]> partitionFiles = partitions.stream()
+ .collect(Collectors.toMap(p -> p, this::loadPartitionPathFiles));
+ HoodieTimeline activeTimeline = getActiveTimeline();
+ Option<HoodieInstant> latestInstant = activeTimeline.lastInstant();
+
+ FileStatus[] allFiles =
partitionFiles.values().stream().flatMap(Arrays::stream).toArray(FileStatus[]::new);
+ HoodieTableFileSystemView fileSystemView = new
HoodieTableFileSystemView(metaClient, activeTimeline, allFiles);
Review Comment:
Gotcha, that makes sense. So from the perspective of the file listing, the
issue is the number of partial lookups, right? Out of curiosity, would it be
possible to do the lookups on a less granular level? For instance, if a table
is partitioned by 2 columns but the query only provides one equality predicate
with a partitioning column, the algorithm in this PR will first fetch all leaf
partitions that match the predicate and then list the files in all leaf
partitions one by one. Instead, could you do a single partial lookup just on
the column provided in the predicate? I'm not too familiar with the internals
of the files index so this is just for my own curiosity.
Thank you both for the explanation and for implementing this. I'm still
excited for this patch to land, even with MT disabled. For context, I have a
table with a few PBs of data partitioned by 2 columns for a total of ~1.5M
partitions. On 0.12.0 with MT enabled, the file listing takes ~15min regardless
of the query. Most of the queries have an equality predicate on one of the
columns, and when I tested this patch yesterday the listing with MT disabled
only takes ~30s.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]