hudi-agent commented on code in PR #19045:
URL: https://github.com/apache/hudi/pull/19045#discussion_r3464757996


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -268,6 +275,37 @@ private void init(HoodieRecord record) {
     doInit = false;
   }
 
+  /**
+   * Returns true when this append handle is writing to a layout sub-path of 
an MDT partition
+   * (e.g. {@code record_index/0004} under sub-directory bucketing). In that 
case, the
+   * {@code .hoodie_partition_metadata} marker must NOT be created at the 
bucket level; it is
+   * written once at the logical partition root by the MDT initialization path.
+   *
+   * <p>Heuristic: an MDT partition path of the form {@code 
<known-mdt-partition>/<NNNN>} where
+   * {@code NNNN} is the standard 4-digit bucket name produced by the 
sub-directory bucketing
+   * layout. Third-party layouts using a different sub-path naming scheme can 
ship their own
+   * append-handle integration; the OSS-shipped layouts use this convention.
+   */
+  private boolean isMDTLayoutSubPath(String physicalPartitionPath) {

Review Comment:
   🤖 This couples the engine-agnostic append handle to one layout's 
directory-naming convention through a string heuristic. Two concerns: (1) 
brittleness — `SubDirBucketedMDTLayout.bucketRelativePath` uses `%04d`, so once 
the bucket index reaches 10000 the suffix becomes 5 digits, 
`isMDTLayoutSubPath` returns false, and a `.hoodie_partition_metadata` marker 
gets written inside the bucket dir — which would then surface 
`record_index/10000` as a logical partition and break the central invariant; 
(2) layering — could the marker-skip decision be driven by the resolved 
`HoodieMetadataTableLayout` (it already knows its scheme via 
`getPartitionMarkerPaths`) instead of re-deriving "is this a sub-path" 
heuristically here?
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/LayoutContext.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.util.Option;
+
+import java.io.Serializable;
+
+/**
+ * Per-file-group context handed to a {@link HoodieMetadataTableLayout}.
+ *
+ * <p>Carries everything a layout needs to compute the on-disk relative path
+ * and fileId for a single file group: the metadata partition type, the global
+ * file-group index, the total number of file groups in that partition, and
+ * the data-table partition name (set only for partitioned RLI).
+ */
+public final class LayoutContext implements Serializable {

Review Comment:
   🤖 nit: `LayoutContext` is quite generic — could this be 
`HoodieMetadataLayoutContext` to align with the `Hoodie*` prefix convention 
used by other public classes in the metadata package (e.g. 
`HoodieMetadataTableLayout`, `HoodieMetadataTableLayouts`)? It's part of the 
layout SPI so external implementors will reference it directly.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1514,27 +1514,52 @@ private static List<FileSlice> 
getPartitionFileSlices(HoodieTableMetaClient meta
     HoodieTableFileSystemView fsView = null;
     try {
       fsView = fileSystemView.orElseGet(() -> 
getFileSystemViewForMetadataTable(metaClient));
-      Stream<FileSlice> fileSliceStream;
-      if (mergeFileSlices) {
-        if 
(metaClient.getActiveTimeline().filterCompletedInstants().lastInstant().isPresent())
 {
-          fileSliceStream = fsView.getLatestMergedFileSlicesBeforeOrOn(
-              // including pending compaction instant as the last instant so 
that the finished delta commits
-              // that start earlier than the compaction can be queried.
-              partition, 
metaClient.getActiveTimeline().filterCompletedAndCompactionInstants().lastInstant().get().requestedTime());
+      List<String> physicalPartitions = resolvePhysicalPartitions(metaClient, 
partition);
+      List<FileSlice> all = new ArrayList<>();
+      boolean any = false;

Review Comment:
   🤖 nit: the `any` flag just tracks whether the loop ran at all — an `if 
(physicalPartitions.isEmpty()) return Collections.emptyList();` guard before 
the loop would express the same intent more directly and remove the flag 
entirely.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/metadata/FileIdInfo.java:
##########
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.util.Option;
+
+import java.io.Serializable;
+
+/**
+ * Parsed representation of an MDT fileId, produced by
+ * {@link HoodieMetadataTableLayout#parseFileId}.
+ */
+public final class FileIdInfo implements Serializable {

Review Comment:
   🤖 nit: `FileIdInfo` reads as a very plain data-holder name — 
`HoodieMetadataFileIdInfo` (or `HoodieMetadataLayoutFileIdInfo`) would keep it 
consistent with the `Hoodie*` convention and make it clear this is part of the 
layout SPI rather than a generic utility record.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -2081,6 +2192,14 @@ private SerializableFunction<HoodieRecord, HoodieRecord> 
getRecordTagger(String
       FileSlice slice = fileSlices.get(mappingFunction.apply(r.getRecordKey(), 
fileGroupCount));
       r.unseal();
       r.setCurrentLocation(new 
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+      // Under layouts that place file groups in sub-directories (e.g. 
bucketing), the file slice's
+      // physical partition path may differ from the logical MDT partition 
name carried on the
+      // record. Realign the record's partition path with where the file slice 
actually lives so
+      // downstream HoodieAppendHandle's partition-path consistency check 
passes.
+      String physical = slice.getPartitionPath();

Review Comment:
   🤖 Realigning the record to the physical bucket path makes MDT delta-commit 
write stats record `record_index/0004`, which incremental compaction/cleaning 
can use. But the full-listing fallbacks don't fan out: 
`BaseTableServicePlanActionExecutor.getPartitions()` returns logical names via 
`FSUtils.getAllPartitionPaths`, then `BaseHoodieCompactionPlanGenerator` calls 
`fileSystemView.getLatestFileSlicesStateless(<logical>)`; cleaning's 
`getPartitionPathsForFullCleaning()` does the same via 
`getAllPartitionPaths()`. Neither sees file groups under bucket sub-paths. Have 
you verified compaction and cleaning actually process the bucketed RLI file 
groups? The test here only does 3 commits, so it likely never triggers them — 
and uncompacted log growth would defeat the scaling goal. @yihua might want to 
weigh in.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -349,15 +349,27 @@ private Map<PartitionPath, List<FileSlice>> 
filterFiles(List<PartitionPath> part
       // API.  Note that for COW table, the merging logic of two slices does 
not happen as there
       // is no compaction, thus there is no performance impact.
       HoodieTableFileSystemView finalFileSystemView = fileSystemView;
+      // For an MDT under a non-flat layout, the FS view indexes file slices 
by their physical
+      // partition (bucket sub-paths), but the partition list here uses 
logical partition names.
+      // Resolve via HoodieTableMetadataUtil which fans out across the 
layout's physical sub-paths.
+      final boolean isMdt = HoodieTableMetadata.isMetadataTable(basePath);
       return partitions.stream().collect(
           Collectors.toMap(
               Function.identity(),
-              partitionPath ->
-                  queryInstant.map(instant ->
-                          
finalFileSystemView.getLatestMergedFileSlicesBeforeOrOn(partitionPath.path, 
queryInstant.get())
-                      )
-                      .orElseGet(() -> 
finalFileSystemView.getLatestFileSlices(partitionPath.path))
-                      .collect(Collectors.toList())
+              partitionPath -> {
+                if (isMdt) {
+                  return queryInstant.isPresent()

Review Comment:
   🤖 For an MDT time-travel query (`queryInstant` present), this routes to 
`getPartitionLatestMergedFileSlices`, which internally merges against the 
timeline's last completed (and compaction) instant rather than `queryInstant`. 
The previous code passed `queryInstant.get()` to 
`getLatestMergedFileSlicesBeforeOrOn`. Since `isMdt` is true for any MDT — 
including the flat default — this silently changes time-travel-on-MDT semantics 
to "always latest" even for tables that didn't opt into a layout. Is that 
intended? If time-travel on the MDT path matters, it may be worth threading 
`queryInstant` through a layout-aware merged-slice helper.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to