pitrou commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r2086148223
########## docs/source/python/parquet.rst: ########## @@ -782,3 +782,63 @@ file decryption properties) is optional and it includes the following options: * ``cache_lifetime``, the lifetime of cached entities (key encryption keys, local wrapping keys, KMS client objects) represented as a ``datetime.timedelta``. + + +Content-Defined Chunking +------------------------ + +.. note:: + This feature is experimental and may change in future releases. + +PyArrow introduces an experimental feature for optimizing Parquet files for content +addressable storage (CAS) systems using content-defined chunking (CDC). This feature +enables efficient deduplication of data across files, improving network transfers and +storage efficiency. + +When enabled, data pages are written according to content-defined chunk boundaries, +determined by a rolling hash algorithm that identifies chunk boundaries based on the +actual content of the data. When data in a column is modified (e.g., inserted, deleted, +or updated), this approach minimizes the number of changed data pages. + +The feature can be enabled by setting the ``use_content_defined_chunking`` parameter in +the Parquet writer. It accepts either a boolean or a dictionary for configuration: + +- ``True``: Uses the default configuration with: + - Minimum chunk size: 256 KiB + - Maximum chunk size: 1024 KiB + - Normalization level: 0 + +- ``dict``: Allows customization of the chunking parameters: + - ``min_chunk_size``: Minimum chunk size in bytes (default: 256 KiB). + - ``max_chunk_size``: Maximum chunk size in bytes (default: 1024 KiB). + - ``norm_level``: Normalization level to adjust chunk size distribution (default: 0). + +Note that the chunk size is calculated on the logical values before applying any encoding +or compression. The actual size of the data pages may vary based on the encoding and +compression used. + +.. note:: + Ensure that Parquet write options remain consistent across writes and files. + Using different write options (like compression, encoding, or row group size) + for different files may prevent proper deduplication and lead to suboptimal + storage efficiency. Review Comment: ```suggestion .. note:: To make the most of this feature, you should ensure that Parquet write options remain consistent across writes and files. Using different write options (like compression, encoding, or row group size) for different files may prevent proper deduplication and lead to suboptimal storage efficiency. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org