nsivabalan commented on issue #18178: URL: https://github.com/apache/hudi/issues/18178#issuecomment-4770730271
Sharing an alternative design for this requirement. We received the same scale problem from a community user — RLI partition hitting per-directory file count limits on HDFS — and arrived at a slightly different shape than the proposal above. Code: https://github.com/nsivabalan/hudi/tree/mdt_layout_spi (PR incoming). ## Proposal: pluggable MDT layout SPI Instead of hard-wiring sub-directory bucketing into the MDT writer/reader, introduce a small SPI for the MDT's on-disk file-group layout. Two implementations ship in OSS: - **`FlatMDTLayout`** (default) — today's behavior, bit-for-bit. Every file group lives directly under its MDT partition directory. Existing tables with no opt-in get identical on-disk and properties layout as before. - **`SubDirBucketedMDTLayout`** (opt-in) — distributes file groups into 4-digit bucket sub-directories. Bucket = `fileGroupIndex / bucketSize`. On-disk shape: ``` .hoodie/metadata/record_index/ ├── .hoodie_partition_metadata ← single marker at the LOGICAL ROOT ├── 0000/ │ ├── .record-index-0000-0_<instant>.log │ └── ... (up to bucketSize file groups) ├── 0001/ │ └── ... ``` Third parties (including any deployment with custom growth or scaling needs) can ship their own `HoodieMetadataTableLayout` implementation and wire it in via `hoodie.metadata.layout.class` without forking Hudi. ## Why the SPI shape Three concerns drove the design: ### 1. `.hoodie_partition_metadata` placement and the MDT-as-Hudi-table contract The MDT is itself a Hudi table. Tooling and direct queries on `.hoodie/metadata/` rely on `FSUtils.getAllPartitionPaths` → `HoodieTableMetadata.getAllPartitionPaths` → `FileSystemBackedTableMetadata`'s partition recursion, which uses `.hoodie_partition_metadata` as the "this is a Hudi partition" marker (`hudi-common/.../FileSystemBackedTableMetadata.java`, around the recursion that calls `HoodiePartitionMetadata.hasPartitionMetadata`). If the marker is placed inside each bucket directory, `FSUtils.getAllPartitionPaths(.hoodie/metadata)` returns `[files/0000, record_index/0000, record_index/0001, ...]` instead of `[files, record_index, ...]`. Three observable consequences: - A direct Spark query on the MDT (`spark.read.format("hudi").load(.hoodie/metadata)` — see `TestSecondaryIndexPruning`, `HoodieMetadataTableValidator`) sees the bucket paths as logical partitions. When Spark asks for file slices of `record_index/0000`, the fan-out tries to enumerate sub-buckets *under* it, finds none, and returns empty rows. - `hudi-cli metadata partitions` reports N×M partitions instead of N. - Validation tooling reports per-bucket stats instead of per-partition. The SPI's `getPartitionMarkerPaths` lets each layout decide where markers go. Both shipped implementations place a single marker at the logical partition root only. Marker creation in `HoodieAppendHandle.doInit` is guarded so it never creates a marker inside a layout sub-path on the MDT. ### 2. Read-side fan-out without per-call FS listing Listing sub-directories of an MDT partition on every read defeats one of the MDT's design goals (avoiding FS listings on the read path). The SPI's `getPhysicalPartitions(logicalPartition, fileGroupCount)` is pure math — it returns `[partition]` for the flat layout and `["partition/0000", "partition/0001", ...]` for sub-dir bucketing, with `fileGroupCount` sourced from a new MDT property `hoodie.metadata.layout.partition.file.group.counts` written once at init. No FS listing. ### 3. Single source of truth for layout state Layout class + bucket size + per-partition file-group counts are persisted only on the MDT's own `hoodie.properties` (not on the data table's). The data-table side has no knowledge of the MDT's internal layout. Writers and readers route through `metadataMetaClient.getTableConfig()`. ## Patch 1 scope This first PR focuses on the **global RLI** mode (and all other non-partitioned MDT partitions: files, column_stats, bloom_filters, expression_index, secondary_index). Partitioned RLI is **rejected up front** with a clear error if the user enables the layout on a table whose RLI is in the partitioned mode — its growth model (file groups appear as new data partitions land) needs a distinct strategy. ## Follow-ups (separate work) 1. **Partitioned-RLI layout** — a layout where the sub-directory is the data-partition name (`record_index/<dataPartition>/`). Likely needs an additive lifecycle hook (`onFileGroupsAdded`) for growth. This is being captured in a forthcoming RFC. 2. **Table-version gating** — introduce TV 10 and gate sub-directory bucketing on it so pre-1.3.0 readers fail fast with a clear error rather than silently mis-read. 3. **Backward-compat shim** — for any in-the-wild deployments that already have a different bucketing layout, an optional compat shim that recognizes the legacy property name. ## Compared with #18182 This proposal is an alternative shape that: - Fixes the marker-placement issue (root vs. bucket) so MDT-as-Hudi-table semantics hold under bucketing. - Removes per-call FS listing on reads by sourcing `fileGroupCount` from MDT properties. - Keeps the layout property on the MDT only (single source of truth). - Exposes the layout as a pluggable SPI so third parties can ship their own without forking. Happy to discuss tradeoffs or merge with the existing work — would appreciate community input before we open the PR. Will follow up with a full RFC for comprehensive partitioned-RLI support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
