hudi-agent commented on code in PR #19045:
URL: https://github.com/apache/hudi/pull/19045#discussion_r3464757996
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -268,6 +275,37 @@ private void init(HoodieRecord record) {
doInit = false;
}
+ /**
+ * Returns true when this append handle is writing to a layout sub-path of
an MDT partition
+ * (e.g. {@code record_index/0004} under sub-directory bucketing). In that
case, the
+ * {@code .hoodie_partition_metadata} marker must NOT be created at the
bucket level; it is
+ * written once at the logical partition root by the MDT initialization path.
+ *
+ * <p>Heuristic: an MDT partition path of the form {@code
<known-mdt-partition>/<NNNN>} where
+ * {@code NNNN} is the standard 4-digit bucket name produced by the
sub-directory bucketing
+ * layout. Third-party layouts using a different sub-path naming scheme can
ship their own
+ * append-handle integration; the OSS-shipped layouts use this convention.
+ */
+ private boolean isMDTLayoutSubPath(String physicalPartitionPath) {
Review Comment:
🤖 This couples the engine-agnostic append handle to one layout's
directory-naming convention through a string heuristic. Two concerns: (1)
brittleness — `SubDirBucketedMDTLayout.bucketRelativePath` uses `%04d`, so once
the bucket index reaches 10000 the suffix becomes 5 digits,
`isMDTLayoutSubPath` returns false, and a `.hoodie_partition_metadata` marker
gets written inside the bucket dir — which would then surface
`record_index/10000` as a logical partition and break the central invariant;
(2) layering — could the marker-skip decision be driven by the resolved
`HoodieMetadataTableLayout` (it already knows its scheme via
`getPartitionMarkerPaths`) instead of re-deriving "is this a sub-path"
heuristically here?
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/metadata/LayoutContext.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.util.Option;
+
+import java.io.Serializable;
+
+/**
+ * Per-file-group context handed to a {@link HoodieMetadataTableLayout}.
+ *
+ * <p>Carries everything a layout needs to compute the on-disk relative path
+ * and fileId for a single file group: the metadata partition type, the global
+ * file-group index, the total number of file groups in that partition, and
+ * the data-table partition name (set only for partitioned RLI).
+ */
+public final class LayoutContext implements Serializable {
Review Comment:
🤖 nit: `LayoutContext` is quite generic — could this be
`HoodieMetadataLayoutContext` to align with the `Hoodie*` prefix convention
used by other public classes in the metadata package (e.g.
`HoodieMetadataTableLayout`, `HoodieMetadataTableLayouts`)? It's part of the
layout SPI so external implementors will reference it directly.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1514,27 +1514,52 @@ private static List<FileSlice>
getPartitionFileSlices(HoodieTableMetaClient meta
HoodieTableFileSystemView fsView = null;
try {
fsView = fileSystemView.orElseGet(() ->
getFileSystemViewForMetadataTable(metaClient));
- Stream<FileSlice> fileSliceStream;
- if (mergeFileSlices) {
- if
(metaClient.getActiveTimeline().filterCompletedInstants().lastInstant().isPresent())
{
- fileSliceStream = fsView.getLatestMergedFileSlicesBeforeOrOn(
- // including pending compaction instant as the last instant so
that the finished delta commits
- // that start earlier than the compaction can be queried.
- partition,
metaClient.getActiveTimeline().filterCompletedAndCompactionInstants().lastInstant().get().requestedTime());
+ List<String> physicalPartitions = resolvePhysicalPartitions(metaClient,
partition);
+ List<FileSlice> all = new ArrayList<>();
+ boolean any = false;
Review Comment:
🤖 nit: the `any` flag just tracks whether the loop ran at all — an `if
(physicalPartitions.isEmpty()) return Collections.emptyList();` guard before
the loop would express the same intent more directly and remove the flag
entirely.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileIdInfo.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.util.Option;
+
+import java.io.Serializable;
+
+/**
+ * Parsed representation of an MDT fileId, produced by
+ * {@link HoodieMetadataTableLayout#parseFileId}.
+ */
+public final class FileIdInfo implements Serializable {
Review Comment:
🤖 nit: `FileIdInfo` reads as a very plain data-holder name —
`HoodieMetadataFileIdInfo` (or `HoodieMetadataLayoutFileIdInfo`) would keep it
consistent with the `Hoodie*` convention and make it clear this is part of the
layout SPI rather than a generic utility record.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -2081,6 +2192,14 @@ private SerializableFunction<HoodieRecord, HoodieRecord>
getRecordTagger(String
FileSlice slice = fileSlices.get(mappingFunction.apply(r.getRecordKey(),
fileGroupCount));
r.unseal();
r.setCurrentLocation(new
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+ // Under layouts that place file groups in sub-directories (e.g.
bucketing), the file slice's
+ // physical partition path may differ from the logical MDT partition
name carried on the
+ // record. Realign the record's partition path with where the file slice
actually lives so
+ // downstream HoodieAppendHandle's partition-path consistency check
passes.
+ String physical = slice.getPartitionPath();
Review Comment:
🤖 Realigning the record to the physical bucket path makes MDT delta-commit
write stats record `record_index/0004`, which incremental compaction/cleaning
can use. But the full-listing fallbacks don't fan out:
`BaseTableServicePlanActionExecutor.getPartitions()` returns logical names via
`FSUtils.getAllPartitionPaths`, then `BaseHoodieCompactionPlanGenerator` calls
`fileSystemView.getLatestFileSlicesStateless(<logical>)`; cleaning's
`getPartitionPathsForFullCleaning()` does the same via
`getAllPartitionPaths()`. Neither sees file groups under bucket sub-paths. Have
you verified compaction and cleaning actually process the bucketed RLI file
groups? The test here only does 3 commits, so it likely never triggers them —
and uncompacted log growth would defeat the scaling goal. @yihua might want to
weigh in.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -349,15 +349,27 @@ private Map<PartitionPath, List<FileSlice>>
filterFiles(List<PartitionPath> part
// API. Note that for COW table, the merging logic of two slices does
not happen as there
// is no compaction, thus there is no performance impact.
HoodieTableFileSystemView finalFileSystemView = fileSystemView;
+ // For an MDT under a non-flat layout, the FS view indexes file slices
by their physical
+ // partition (bucket sub-paths), but the partition list here uses
logical partition names.
+ // Resolve via HoodieTableMetadataUtil which fans out across the
layout's physical sub-paths.
+ final boolean isMdt = HoodieTableMetadata.isMetadataTable(basePath);
return partitions.stream().collect(
Collectors.toMap(
Function.identity(),
- partitionPath ->
- queryInstant.map(instant ->
-
finalFileSystemView.getLatestMergedFileSlicesBeforeOrOn(partitionPath.path,
queryInstant.get())
- )
- .orElseGet(() ->
finalFileSystemView.getLatestFileSlices(partitionPath.path))
- .collect(Collectors.toList())
+ partitionPath -> {
+ if (isMdt) {
+ return queryInstant.isPresent()
Review Comment:
🤖 For an MDT time-travel query (`queryInstant` present), this routes to
`getPartitionLatestMergedFileSlices`, which internally merges against the
timeline's last completed (and compaction) instant rather than `queryInstant`.
The previous code passed `queryInstant.get()` to
`getLatestMergedFileSlicesBeforeOrOn`. Since `isMdt` is true for any MDT —
including the flat default — this silently changes time-travel-on-MDT semantics
to "always latest" even for tables that didn't opt into a layout. Is that
intended? If time-travel on the MDT path matters, it may be worth threading
`queryInstant` through a layout-aware merged-slice helper.
<sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag
quality.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]