nsivabalan opened a new pull request, #19045:
URL: https://github.com/apache/hudi/pull/19045

   ### Change Logs
   
   This PR introduces a pluggable layout SPI for the Hudi Metadata Table (MDT) 
so the on-disk organization of file groups can be customized without forking 
the writer or reader paths. It ships two implementations in OSS:
   
   - **`FlatMDTLayout`** (default) — today's behavior, bit-for-bit. Every file 
group lives directly under its MDT partition directory.
   - **`SubDirBucketedMDTLayout`** (opt-in) — distributes file groups into 
4-digit bucket sub-directories so large MDT partitions do not exceed 
per-directory file-count limits on HDFS-style filesystems.
   
   Tables that don't opt in get identical on-disk and properties layout as 
before — no migration required, no regressions in the default path.
   
   Closes #18178 (proposal in 
https://github.com/apache/hudi/issues/18178#issuecomment-4770730271).
   
   ---
   
   ### Motivation
   
   Large-scale Hudi tables can accumulate many file groups in a single MDT 
partition. The HFile + log files for ~15K file groups land in 
`.hoodie/metadata/record_index/` directly, which can exceed per-directory 
file-count limits on HDFS and degrade list performance. The same scaling 
pressure affects other MDT partitions (column_stats, bloom_filters, 
expression/secondary indexes) for sufficiently wide tables.
   
   A community user surfaced this scaling concern; this PR is one way to 
address it while keeping the existing layout untouched for users who don't opt 
in, and leaving room for third parties to ship their own layout implementations 
via the SPI.
   
   ---
   
   ### Design
   
   #### SPI
   
   ```java
   public interface HoodieMetadataTableLayout extends Serializable {
     String getLayoutId();
     String getFileGroupRelativePath(LayoutContext ctx);
     String getFileId(LayoutContext ctx);
     FileIdInfo parseFileId(MetadataPartitionType type, String fileId);
     List<String> getPhysicalPartitions(String logicalPartition, int 
fileGroupCount);
     List<String> getPartitionMarkerPaths(String logicalPartition, int 
fileGroupCount);
   }
   ```
   
   `LayoutContext` carries `(MetadataPartitionType, fileGroupIndex, 
fileGroupCount, Option<dataPartitionName>)`. Layout instances are constructed 
once per MDT via a factory that resolves `hoodie.metadata.layout.class` from 
the **MDT's own `hoodie.properties`** (single source of truth — not duplicated 
on the data table).
   
   #### On-disk shapes
   
   **Flat (default, today's behavior):**
   ```
   .hoodie/metadata/record_index/
   ├── .hoodie_partition_metadata
   ├── .record-index-0000-0_<instant>.log
   ├── .record-index-0001-0_<instant>.log
   └── ... (up to fileGroupCount entries)
   ```
   
   **Sub-directory bucketed (opt-in):**
   ```
   .hoodie/metadata/record_index/
   ├── .hoodie_partition_metadata           ← single marker at the LOGICAL ROOT
   ├── 0000/
   │   ├── .record-index-0000-0_<instant>.log
   │   └── ... (up to bucketSize file groups)
   ├── 0001/
   │   ├── .record-index-1000-0_<instant>.log
   │   └── ...
   ```
   
   For the bucketed layout: bucket = `fileGroupIndex / bucketSize`, default 
bucketSize = 1000.
   
   Crucially, `.hoodie_partition_metadata` lives only at the **logical 
partition root**, never inside a bucket sub-directory. This preserves 
`MDT-as-Hudi-table` semantics so 
`FSUtils.getAllPartitionPaths(.hoodie/metadata)` returns `["files", 
"record_index", ...]` regardless of bucketing.
   
   #### Persisted state (MDT-only)
   
   Three new table properties, all on the MDT's `hoodie.properties`, all 
written only when a non-flat layout is in use:
   
   | Property | Purpose |
   |---|---|
   | `hoodie.metadata.layout.class` | FQCN of the layout impl (immutable, set 
once at MDT init) |
   | `hoodie.metadata.layout.bucket.size` | Bucket size for 
`SubDirBucketedMDTLayout` |
   | `hoodie.metadata.layout.partition.file.group.counts` | Comma-separated 
`partition=count` map; sourced from the same values the writer computes at MDT 
init |
   
   The third property lets readers compute physical sub-paths without 
performing any filesystem listing.
   
   #### Plug points
   
   **Writer (`hudi-client-common`):**
   - `HoodieBackedTableMetadataWriter.initializeFileGroups`: uses 
`layout.getFileGroupRelativePath(ctx)` and `layout.getFileId(ctx)` per file 
group; writes `.hoodie_partition_metadata` at 
`layout.getPartitionMarkerPaths(...)`; persists layout state.
   - `HoodieBackedTableMetadataWriter.resolveLayoutForMDTInit`: fails fast with 
a clear error if the user requests a non-flat layout while RLI is configured in 
the partitioned mode (see Scope below).
   - `HoodieBackedTableMetadataWriter.getRecordTagger`: realigns the record's 
`partitionPath` with the file slice's physical bucket path under non-flat 
layouts, so downstream `HoodieAppendHandle` partition-path consistency checks 
pass.
   - `HoodieAppendHandle.doInit`: skips marker creation when writing into a 
layout sub-path on an MDT, preventing per-bucket markers.
   
   **Reader (`hudi-common`):**
   - `HoodieTableMetadataUtil.getPartitionFileSlices` and 
`getPartitionLatestFileSlicesIncludingInflight`: fan out across 
`layout.getPhysicalPartitions(partition, fgCount)` and aggregate. `fgCount` 
comes from the persisted layout state — no FS listing.
   - `FileSystemBackedTableMetadata.getAllFilesInPartition` and 
`getAllFilesInPartitions`: layout-aware for MDT base paths so direct Spark 
queries on the MDT pick up file groups under bucket sub-directories.
   - `BaseHoodieTableFileIndex.filterFiles`: routes MDT partitions through 
`HoodieTableMetadataUtil` so the Spark file index sees correctly-grouped file 
slices under bucketing.
   
   ---
   
   ### Scope (patch 1)
   
   - ✅ Non-partitioned MDT partitions: `files`, `column_stats`, 
`bloom_filters`, `expression_index`, `secondary_index`.
   - ✅ **Global RLI** (non-partitioned RLI).
   - ❌ **Partitioned RLI**: explicitly rejected at MDT init with a clear error. 
Partitioned-RLI growth (file groups appear over time as new data partitions 
land) needs a distinct bucketing strategy that will be addressed in a follow-up 
patch / RFC. The check is at writer-init time so misconfiguration fails fast 
rather than producing a half-broken table.
   
   ---
   
   ### Backward compatibility
   
   - Tables that don't set `hoodie.metadata.layout.class` continue to use 
`FlatMDTLayout`, with **bit-identical on-disk and properties layout** as 
before. No new properties are written.
   - The layout is fixed at MDT initialization and cannot be retrofitted onto 
an existing MDT (consistent with how MDT partitions are bootstrapped today).
   - **Forward-compat for pre-1.3.0 readers**: a pre-1.3.0 reader opening a MDT 
with sub-directory bucketing enabled does not understand the new layout. 
Documentation should call this out; a follow-up patch will add a table-version 
gate so pre-1.3.0 readers fail fast with a clear error rather than silently 
return empty results. Operators who opt into the bucketing layout in 1.3.0 
should ensure all readers are on ≥ 1.3.0 first.
   
   ---
   
   ### Configuration
   
   Writer-side configs (default values yield today's behavior):
   
   | Config | Default | Description |
   |---|---|---|
   | `hoodie.metadata.layout.class` | (unset → `FlatMDTLayout`) | FQCN of the 
layout class. Set to `org.apache.hudi.metadata.SubDirBucketedMDTLayout` to 
enable sub-directory bucketing. |
   | `hoodie.metadata.layout.bucket.size` | `1000` | Maximum file groups per 
bucket sub-directory under `SubDirBucketedMDTLayout`. Ignored otherwise. |
   
   These configs apply only at MDT initialization. An MDT already on disk keeps 
its existing layout regardless of writer config changes.
   
   ### Impact
   
   - **Storage layout**: no change for tables that don't opt in. New optional 
layout for tables that do.
   - **API**: no public API breakage. SPI is new and additive.
   - **Performance**: no read-side regression (no extra FS listing); 
writer-side adds one properties write per partition init under non-flat layouts.
   - **Default behavior**: unchanged.
   
   ### Risk level
   
   `medium`
   
   Touches several core MDT writer/reader paths but with careful conditional 
guards so the flat default keeps identical behavior. Comprehensive unit + 
functional tests; new functional test validates the MDT-as-Hudi-table contract 
under bucketing.
   
   ### Documentation Update
   
   - New configs documented inline (`@ConfigProperty` annotations on 
`HoodieTableConfig` and `HoodieMetadataConfig`).
   - A standalone RFC for the full layout SPI, including the partitioned-RLI 
design, will be raised separately.
   
   ---
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to