[ 
https://issues.apache.org/jira/browse/HUDI-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-9685:
---------------------------------
    Labels: pull-request-available  (was: )

> Enable row group-level file stitching in Hudi clustering using schema 
> grouping and Parquet APIs
> -----------------------------------------------------------------------------------------------
>
>                 Key: HUDI-9685
>                 URL: https://issues.apache.org/jira/browse/HUDI-9685
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Xinli Shang
>            Priority: Major
>              Labels: pull-request-available
>
> This ticket builds on the groundwork laid in 
> https://github.com/apache/hudi/pull/13365, which added support for merging 
> small files during clustering. That initial implementation handled schema 
> evolution and masking via low-level Parquet logic, introducing complexity in 
> managing column alignment and schema compatibility.
> This follow-up introduces a more scalable and simplified approach:
> * Group files by schema compatibility before merging, eliminating the need to 
> reconcile schema differences during the merge.
> * Leverage Parquet’s native file merging capabilities, specifically code 
> paths like appendFile() and mergeMetadataFiles() from the original 
> parquet-tools, which had been deprecated but were previously battle-tested 
> and used in production.
> * Introduce a new implementation: HoodieParquetStrictMerge, which performs 
> efficient row group merging while assuming all files in the group have an 
> identical schema.
> * Add a fast binary-level file copier (LiteFileBinaryCopier) to support 
> performant file operations.
> * Update the clustering plan strategy (PartitionAwareClusteringPlanStrategy) 
> to incorporate schema grouping and row group-level optimization.
> This feature is:
> * Controlled via a new config: hoodie.storage.parquet.lite.file.merger.enable 
> (default: false)
> * Backward-compatible and off by default
> * Complementary to the previous implementation by simplifying the code path 
> when schema compatibility is ensured
> Benefits:
> * Reduces the complexity of handling schema evolution in clustering
> * Avoids the overhead of column masking and re-writing data unnecessarily
> * Produces better-optimized files with reduced small files and improved read 
> performance
> Verification:
> * Unit and integration tests added for new components



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to