Xinli Shang created HUDI-9685:
---------------------------------

             Summary: Enable row group-level file stitching in Hudi clustering 
using schema grouping and Parquet APIs
                 Key: HUDI-9685
                 URL: https://issues.apache.org/jira/browse/HUDI-9685
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Xinli Shang


This ticket builds on the groundwork laid in 
https://github.com/apache/hudi/pull/13365, which added support for merging 
small files during clustering. That initial implementation handled schema 
evolution and masking via low-level Parquet logic, introducing complexity in 
managing column alignment and schema compatibility.

This follow-up introduces a more scalable and simplified approach:

* Group files by schema compatibility before merging, eliminating the need to 
reconcile schema differences during the merge.

* Leverage Parquet’s native file merging capabilities, specifically code paths 
like appendFile() and mergeMetadataFiles() from the original parquet-tools, 
which had been deprecated but were previously battle-tested and used in 
production.

* Introduce a new implementation: HoodieParquetStrictMerge, which performs 
efficient row group merging while assuming all files in the group have an 
identical schema.

* Add a fast binary-level file copier (LiteFileBinaryCopier) to support 
performant file operations.

* Update the clustering plan strategy (PartitionAwareClusteringPlanStrategy) to 
incorporate schema grouping and row group-level optimization.

This feature is:

* Controlled via a new config: hoodie.storage.parquet.lite.file.merger.enable 
(default: false)

* Backward-compatible and off by default

* Complementary to the previous implementation by simplifying the code path 
when schema compatibility is ensured

Benefits:

* Reduces the complexity of handling schema evolution in clustering
* Avoids the overhead of column masking and re-writing data unnecessarily
* Produces better-optimized files with reduced small files and improved read 
performance

Verification:

* Unit and integration tests added for new components




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to