alamb opened a new issue, #21408: URL: https://github.com/apache/datafusion/issues/21408
### Is your feature request related to a problem or challenge? This issue is filed to track the changes proposed in https://github.com/apache/datafusion/pull/21110. DataFusion currently does not expose the new Parquet Content-Defined Chunking (CDC) support added in `parquet-rs` by https://github.com/apache/arrow-rs/pull/9450. Traditional Parquet writing splits data pages at fixed sizes, so inserting or deleting a row causes subsequent pages to shift and can force nearly all bytes to be re-uploaded in content-addressable storage systems. CDC instead determines page boundaries using a rolling hash over column values, so unchanged data can produce identical pages across writes. This can reduce storage and upload costs and improve deduplication behavior for rewritten datasets. ### Describe the solution you'd like Expose the Parquet CDC writer options in DataFusion so users can enable the feature when writing Parquet files. This should cover the configuration surface introduced upstream in `parquet-rs`, including: - enabling content-defined chunking with the default settings - configuring explicit CDC parameters such as `min_chunk_size`, `max_chunk_size`, and `norm_level` The implementation and rationale are largely derived from https://github.com/apache/arrow-rs/pull/9450, and this issue exists to track carrying those changes through in DataFusion via https://github.com/apache/datafusion/pull/21110. ### Describe alternatives you've considered Continue using the existing fixed-size Parquet page splitting behavior and do not expose CDC-related writer options in DataFusion. That preserves current behavior, but it means users cannot take advantage of the improved page stability and deduplication characteristics now available in `parquet-rs`. ### Additional context - Tracking PR in DataFusion: https://github.com/apache/datafusion/pull/21110 - Upstream implementation in arrow-rs: https://github.com/apache/arrow-rs/pull/9450 - Related C++ implementation referenced by the upstream PR: https://github.com/apache/arrow/pull/45360 - Background on the feature: https://huggingface.co/blog/parquet-cdc Most of the technical content above is intentionally derived from https://github.com/apache/arrow-rs/pull/9450, with the additional context that this issue tracks the corresponding DataFusion work in https://github.com/apache/datafusion/pull/21110. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
