nsivabalan commented on issue #18178:
URL: https://github.com/apache/hudi/issues/18178#issuecomment-4770730271

   Sharing an alternative design for this requirement. We received the same 
scale problem from a community user — RLI partition hitting per-directory file 
count limits on HDFS — and arrived at a slightly different shape than the 
proposal above. Code: https://github.com/nsivabalan/hudi/tree/mdt_layout_spi 
(PR incoming).
   
   ## Proposal: pluggable MDT layout SPI
   
   Instead of hard-wiring sub-directory bucketing into the MDT writer/reader, 
introduce a small SPI for the MDT's on-disk file-group layout. Two 
implementations ship in OSS:
   
   - **`FlatMDTLayout`** (default) — today's behavior, bit-for-bit. Every file 
group lives directly under its MDT partition directory. Existing tables with no 
opt-in get identical on-disk and properties layout as before.
   - **`SubDirBucketedMDTLayout`** (opt-in) — distributes file groups into 
4-digit bucket sub-directories. Bucket = `fileGroupIndex / bucketSize`. On-disk 
shape:
   
     ```
     .hoodie/metadata/record_index/
     ├── .hoodie_partition_metadata           ← single marker at the LOGICAL 
ROOT
     ├── 0000/
     │   ├── .record-index-0000-0_<instant>.log
     │   └── ... (up to bucketSize file groups)
     ├── 0001/
     │   └── ...
     ```
   
   Third parties (including any deployment with custom growth or scaling needs) 
can ship their own `HoodieMetadataTableLayout` implementation and wire it in 
via `hoodie.metadata.layout.class` without forking Hudi.
   
   ## Why the SPI shape
   
   Three concerns drove the design:
   
   ### 1. `.hoodie_partition_metadata` placement and the MDT-as-Hudi-table 
contract
   
   The MDT is itself a Hudi table. Tooling and direct queries on 
`.hoodie/metadata/` rely on `FSUtils.getAllPartitionPaths` → 
`HoodieTableMetadata.getAllPartitionPaths` → `FileSystemBackedTableMetadata`'s 
partition recursion, which uses `.hoodie_partition_metadata` as the "this is a 
Hudi partition" marker (`hudi-common/.../FileSystemBackedTableMetadata.java`, 
around the recursion that calls `HoodiePartitionMetadata.hasPartitionMetadata`).
   
   If the marker is placed inside each bucket directory, 
`FSUtils.getAllPartitionPaths(.hoodie/metadata)` returns `[files/0000, 
record_index/0000, record_index/0001, ...]` instead of `[files, record_index, 
...]`. Three observable consequences:
   
   - A direct Spark query on the MDT 
(`spark.read.format("hudi").load(.hoodie/metadata)` — see 
`TestSecondaryIndexPruning`, `HoodieMetadataTableValidator`) sees the bucket 
paths as logical partitions. When Spark asks for file slices of 
`record_index/0000`, the fan-out tries to enumerate sub-buckets *under* it, 
finds none, and returns empty rows.
   - `hudi-cli metadata partitions` reports N×M partitions instead of N.
   - Validation tooling reports per-bucket stats instead of per-partition.
   
   The SPI's `getPartitionMarkerPaths` lets each layout decide where markers 
go. Both shipped implementations place a single marker at the logical partition 
root only. Marker creation in `HoodieAppendHandle.doInit` is guarded so it 
never creates a marker inside a layout sub-path on the MDT.
   
   ### 2. Read-side fan-out without per-call FS listing
   
   Listing sub-directories of an MDT partition on every read defeats one of the 
MDT's design goals (avoiding FS listings on the read path). The SPI's 
`getPhysicalPartitions(logicalPartition, fileGroupCount)` is pure math — it 
returns `[partition]` for the flat layout and `["partition/0000", 
"partition/0001", ...]` for sub-dir bucketing, with `fileGroupCount` sourced 
from a new MDT property `hoodie.metadata.layout.partition.file.group.counts` 
written once at init. No FS listing.
   
   ### 3. Single source of truth for layout state
   
   Layout class + bucket size + per-partition file-group counts are persisted 
only on the MDT's own `hoodie.properties` (not on the data table's). The 
data-table side has no knowledge of the MDT's internal layout. Writers and 
readers route through `metadataMetaClient.getTableConfig()`.
   
   ## Patch 1 scope
   
   This first PR focuses on the **global RLI** mode (and all other 
non-partitioned MDT partitions: files, column_stats, bloom_filters, 
expression_index, secondary_index). Partitioned RLI is **rejected up front** 
with a clear error if the user enables the layout on a table whose RLI is in 
the partitioned mode — its growth model (file groups appear as new data 
partitions land) needs a distinct strategy.
   
   ## Follow-ups (separate work)
   
   1. **Partitioned-RLI layout** — a layout where the sub-directory is the 
data-partition name (`record_index/<dataPartition>/`). Likely needs an additive 
lifecycle hook (`onFileGroupsAdded`) for growth. This is being captured in a 
forthcoming RFC.
   2. **Table-version gating** — introduce TV 10 and gate sub-directory 
bucketing on it so pre-1.3.0 readers fail fast with a clear error rather than 
silently mis-read.
   3. **Backward-compat shim** — for any in-the-wild deployments that already 
have a different bucketing layout, an optional compat shim that recognizes the 
legacy property name.
   
   ## Compared with #18182
   
   This proposal is an alternative shape that:
   
   - Fixes the marker-placement issue (root vs. bucket) so MDT-as-Hudi-table 
semantics hold under bucketing.
   - Removes per-call FS listing on reads by sourcing `fileGroupCount` from MDT 
properties.
   - Keeps the layout property on the MDT only (single source of truth).
   - Exposes the layout as a pluggable SPI so third parties can ship their own 
without forking.
   
   Happy to discuss tradeoffs or merge with the existing work — would 
appreciate community input before we open the PR. Will follow up with a full 
RFC for comprehensive partitioned-RLI support.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to