Xinli Shang created HUDI-9685:
---------------------------------
Summary: Enable row group-level file stitching in Hudi clustering
using schema grouping and Parquet APIs
Key: HUDI-9685
URL: https://issues.apache.org/jira/browse/HUDI-9685
Project: Apache Hudi
Issue Type: Improvement
Reporter: Xinli Shang
This ticket builds on the groundwork laid in
https://github.com/apache/hudi/pull/13365, which added support for merging
small files during clustering. That initial implementation handled schema
evolution and masking via low-level Parquet logic, introducing complexity in
managing column alignment and schema compatibility.
This follow-up introduces a more scalable and simplified approach:
* Group files by schema compatibility before merging, eliminating the need to
reconcile schema differences during the merge.
* Leverage Parquet’s native file merging capabilities, specifically code paths
like appendFile() and mergeMetadataFiles() from the original parquet-tools,
which had been deprecated but were previously battle-tested and used in
production.
* Introduce a new implementation: HoodieParquetStrictMerge, which performs
efficient row group merging while assuming all files in the group have an
identical schema.
* Add a fast binary-level file copier (LiteFileBinaryCopier) to support
performant file operations.
* Update the clustering plan strategy (PartitionAwareClusteringPlanStrategy) to
incorporate schema grouping and row group-level optimization.
This feature is:
* Controlled via a new config: hoodie.storage.parquet.lite.file.merger.enable
(default: false)
* Backward-compatible and off by default
* Complementary to the previous implementation by simplifying the code path
when schema compatibility is ensured
Benefits:
* Reduces the complexity of handling schema evolution in clustering
* Avoids the overhead of column masking and re-writing data unnecessarily
* Produces better-optimized files with reduced small files and improved read
performance
Verification:
* Unit and integration tests added for new components
--
This message was sent by Atlassian Jira
(v8.20.10#820010)