pitrou commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2086148223
##########
docs/source/python/parquet.rst:
##########
@@ -782,3 +782,63 @@ file decryption properties) is optional and it includes
the following options:
* ``cache_lifetime``, the lifetime of cached entities (key encryption keys,
local
wrapping keys, KMS client objects) represented as a ``datetime.timedelta``.
+
+
+Content-Defined Chunking
+------------------------
+
+.. note::
+ This feature is experimental and may change in future releases.
+
+PyArrow introduces an experimental feature for optimizing Parquet files for
content
+addressable storage (CAS) systems using content-defined chunking (CDC). This
feature
+enables efficient deduplication of data across files, improving network
transfers and
+storage efficiency.
+
+When enabled, data pages are written according to content-defined chunk
boundaries,
+determined by a rolling hash algorithm that identifies chunk boundaries based
on the
+actual content of the data. When data in a column is modified (e.g., inserted,
deleted,
+or updated), this approach minimizes the number of changed data pages.
+
+The feature can be enabled by setting the ``use_content_defined_chunking``
parameter in
+the Parquet writer. It accepts either a boolean or a dictionary for
configuration:
+
+- ``True``: Uses the default configuration with:
+ - Minimum chunk size: 256 KiB
+ - Maximum chunk size: 1024 KiB
+ - Normalization level: 0
+
+- ``dict``: Allows customization of the chunking parameters:
+ - ``min_chunk_size``: Minimum chunk size in bytes (default: 256 KiB).
+ - ``max_chunk_size``: Maximum chunk size in bytes (default: 1024 KiB).
+ - ``norm_level``: Normalization level to adjust chunk size distribution
(default: 0).
+
+Note that the chunk size is calculated on the logical values before applying
any encoding
+or compression. The actual size of the data pages may vary based on the
encoding and
+compression used.
+
+.. note::
+ Ensure that Parquet write options remain consistent across writes and files.
+ Using different write options (like compression, encoding, or row group
size)
+ for different files may prevent proper deduplication and lead to suboptimal
+ storage efficiency.
Review Comment:
```suggestion
.. note::
To make the most of this feature, you should ensure that Parquet write
options
remain consistent across writes and files.
Using different write options (like compression, encoding, or row group
size)
for different files may prevent proper deduplication and lead to
suboptimal
storage efficiency.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]