alamb opened a new issue, #21408:
URL: https://github.com/apache/datafusion/issues/21408

   ### Is your feature request related to a problem or challenge?
   
   This issue is filed to track the changes proposed in 
https://github.com/apache/datafusion/pull/21110.
   
   DataFusion currently does not expose the new Parquet Content-Defined 
Chunking (CDC) support added in `parquet-rs` by 
https://github.com/apache/arrow-rs/pull/9450. Traditional Parquet writing 
splits data pages at fixed sizes, so inserting or deleting a row causes 
subsequent pages to shift and can force nearly all bytes to be re-uploaded in 
content-addressable storage systems.
   
   CDC instead determines page boundaries using a rolling hash over column 
values, so unchanged data can produce identical pages across writes. This can 
reduce storage and upload costs and improve deduplication behavior for 
rewritten datasets.
   
   ### Describe the solution you'd like
   
   Expose the Parquet CDC writer options in DataFusion so users can enable the 
feature when writing Parquet files.
   
   This should cover the configuration surface introduced upstream in 
`parquet-rs`, including:
   
   - enabling content-defined chunking with the default settings
   - configuring explicit CDC parameters such as `min_chunk_size`, 
`max_chunk_size`, and `norm_level`
   
   The implementation and rationale are largely derived from 
https://github.com/apache/arrow-rs/pull/9450, and this issue exists to track 
carrying those changes through in DataFusion via 
https://github.com/apache/datafusion/pull/21110.
   
   ### Describe alternatives you've considered
   
   Continue using the existing fixed-size Parquet page splitting behavior and 
do not expose CDC-related writer options in DataFusion.
   
   That preserves current behavior, but it means users cannot take advantage of 
the improved page stability and deduplication characteristics now available in 
`parquet-rs`.
   
   ### Additional context
   
   - Tracking PR in DataFusion: https://github.com/apache/datafusion/pull/21110
   - Upstream implementation in arrow-rs: 
https://github.com/apache/arrow-rs/pull/9450
   - Related C++ implementation referenced by the upstream PR: 
https://github.com/apache/arrow/pull/45360
   - Background on the feature: https://huggingface.co/blog/parquet-cdc
   
   Most of the technical content above is intentionally derived from 
https://github.com/apache/arrow-rs/pull/9450, with the additional context that 
this issue tracks the corresponding DataFusion work in 
https://github.com/apache/datafusion/pull/21110.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to