[
https://issues.apache.org/jira/browse/HUDI-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-9685:
---------------------------------
Labels: pull-request-available (was: )
> Enable row group-level file stitching in Hudi clustering using schema
> grouping and Parquet APIs
> -----------------------------------------------------------------------------------------------
>
> Key: HUDI-9685
> URL: https://issues.apache.org/jira/browse/HUDI-9685
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Xinli Shang
> Priority: Major
> Labels: pull-request-available
>
> This ticket builds on the groundwork laid in
> https://github.com/apache/hudi/pull/13365, which added support for merging
> small files during clustering. That initial implementation handled schema
> evolution and masking via low-level Parquet logic, introducing complexity in
> managing column alignment and schema compatibility.
> This follow-up introduces a more scalable and simplified approach:
> * Group files by schema compatibility before merging, eliminating the need to
> reconcile schema differences during the merge.
> * Leverage Parquet’s native file merging capabilities, specifically code
> paths like appendFile() and mergeMetadataFiles() from the original
> parquet-tools, which had been deprecated but were previously battle-tested
> and used in production.
> * Introduce a new implementation: HoodieParquetStrictMerge, which performs
> efficient row group merging while assuming all files in the group have an
> identical schema.
> * Add a fast binary-level file copier (LiteFileBinaryCopier) to support
> performant file operations.
> * Update the clustering plan strategy (PartitionAwareClusteringPlanStrategy)
> to incorporate schema grouping and row group-level optimization.
> This feature is:
> * Controlled via a new config: hoodie.storage.parquet.lite.file.merger.enable
> (default: false)
> * Backward-compatible and off by default
> * Complementary to the previous implementation by simplifying the code path
> when schema compatibility is ensured
> Benefits:
> * Reduces the complexity of handling schema evolution in clustering
> * Avoids the overhead of column masking and re-writing data unnecessarily
> * Produces better-optimized files with reduced small files and improved read
> performance
> Verification:
> * Unit and integration tests added for new components
--
This message was sent by Atlassian Jira
(v8.20.10#820010)