nsivabalan commented on code in PR #19045:
URL: https://github.com/apache/hudi/pull/19045#discussion_r3498253089


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -2081,6 +2192,14 @@ private SerializableFunction<HoodieRecord, HoodieRecord> 
getRecordTagger(String
       FileSlice slice = fileSlices.get(mappingFunction.apply(r.getRecordKey(), 
fileGroupCount));
       r.unseal();
       r.setCurrentLocation(new 
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+      // Under layouts that place file groups in sub-directories (e.g. 
bucketing), the file slice's
+      // physical partition path may differ from the logical MDT partition 
name carried on the
+      // record. Realign the record's partition path with where the file slice 
actually lives so
+      // downstream HoodieAppendHandle's partition-path consistency check 
passes.
+      String physical = slice.getPartitionPath();

Review Comment:
   Acknowledged — this is one of the two non-trivial items we're deferring to a 
follow-up. Compaction (`BaseHoodieCompactionPlanGenerator`) and cleaning's 
full-listing path go through the normal FS view under the logical partition 
name, which under bucketing will return zero file slices. cshuo flagged the 
same in https://github.com/apache/hudi/pull/19045#discussion_r3492391550. 
Tracking: the realignment here makes the per-record write stats land at the 
bucket sub-path so an incremental-stats-driven plan can find them, but the 
listing-driven fallback paths still need a layout-aware fan-out. Will be 
addressed in the bucketing-table-services follow-up alongside the 
partitioned-RLI work.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -268,6 +275,37 @@ private void init(HoodieRecord record) {
     doInit = false;
   }
 
+  /**
+   * Returns true when this append handle is writing to a layout sub-path of 
an MDT partition
+   * (e.g. {@code record_index/0004} under sub-directory bucketing). In that 
case, the
+   * {@code .hoodie_partition_metadata} marker must NOT be created at the 
bucket level; it is
+   * written once at the logical partition root by the MDT initialization path.
+   *
+   * <p>Heuristic: an MDT partition path of the form {@code 
<known-mdt-partition>/<NNNN>} where
+   * {@code NNNN} is the standard 4-digit bucket name produced by the 
sub-directory bucketing
+   * layout. Third-party layouts using a different sub-path naming scheme can 
ship their own
+   * append-handle integration; the OSS-shipped layouts use this convention.
+   */
+  private boolean isMDTLayoutSubPath(String physicalPartitionPath) {

Review Comment:
   Partially addressed and partially deferred.
   
   Fixed in `fcf090641055` + `8a7c60f5c54f`:
   - Layering: moved out of the engine-agnostic append handle into 
`HoodieTableMetadataUtil.isMDTBucketSubPath`; the AppendHandle is now a 3-line 
delegate.
   - Spurious matches: the heuristic now hard-gates on a non-flat MDT layout 
being configured (`getMetadataLayoutClass()` present and != `FlatMDTLayout`), 
so a 4-digit data-table partition value (e.g. `price=1000`) on a flat-default 
table cannot trigger a false positive.
   
   Still pending: the bucket-index >= 10000 case still falls back to false 
(5-digit suffix), which would let a marker be written inside a bucket dir at 
scale. Mitigation in code: `SubDirBucketedMDTLayout.bucketRelativePath` uses 
`%04d` so any bucket index that overflows 4 digits already produces a path the 
heuristic doesn't recognize. Real fix: have the layout itself answer 
`isLayoutSubPath(path)` via the SPI rather than relying on a string heuristic. 
Deferring to the same follow-up that addresses the compaction/cleaning fan-out, 
since both want the layout to be the source of truth for sub-path 
identification.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########


Review Comment:
   Acknowledged — same root cause as the hudi-agent comment at 
https://github.com/apache/hudi/pull/19045#discussion_r3464757989. The 
realignment in `getRecordTagger` makes incremental-stats-driven plans usable, 
but compaction's 
`BaseHoodieCompactionPlanGenerator.getLatestFileSlicesStateless(logical)` and 
cleaning's `getPartitionPathsForFullCleaning()` still go through the 
file-system view under logical names with no layout-aware fan-out. Deferring to 
a bucketing-table-services follow-up patch; happy to chat on the design before 
that lands.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to