Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

via GitHub Tue, 13 May 2025 00:45:17 -0700


pitrou commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2086148223



##########
docs/source/python/parquet.rst:
##########
@@ -782,3 +782,63 @@ file decryption properties) is optional and it includes 
the following options:
 
 * ``cache_lifetime``, the lifetime of cached entities (key encryption keys, 
local
   wrapping keys, KMS client objects) represented as a ``datetime.timedelta``.
+
+
+Content-Defined Chunking
+------------------------
+
+.. note::
+   This feature is experimental and may change in future releases.
+
+PyArrow introduces an experimental feature for optimizing Parquet files for 
content
+addressable storage (CAS) systems using content-defined chunking (CDC). This 
feature
+enables efficient deduplication of data across files, improving network 
transfers and
+storage efficiency.
+
+When enabled, data pages are written according to content-defined chunk 
boundaries,
+determined by a rolling hash algorithm that identifies chunk boundaries based 
on the
+actual content of the data. When data in a column is modified (e.g., inserted, 
deleted,
+or updated), this approach minimizes the number of changed data pages.
+
+The feature can be enabled by setting the ``use_content_defined_chunking`` 
parameter in
+the Parquet writer. It accepts either a boolean or a dictionary for 
configuration:
+
+- ``True``: Uses the default configuration with:
+   - Minimum chunk size: 256 KiB
+   - Maximum chunk size: 1024 KiB
+   - Normalization level: 0
+
+- ``dict``: Allows customization of the chunking parameters:
+   - ``min_chunk_size``: Minimum chunk size in bytes (default: 256 KiB).
+   - ``max_chunk_size``: Maximum chunk size in bytes (default: 1024 KiB).
+   - ``norm_level``: Normalization level to adjust chunk size distribution 
(default: 0).
+
+Note that the chunk size is calculated on the logical values before applying 
any encoding
+or compression. The actual size of the data pages may vary based on the 
encoding and
+compression used.
+
+.. note::
+   Ensure that Parquet write options remain consistent across writes and files.
+   Using different write options (like compression, encoding, or row group 
size)
+   for different files may prevent proper deduplication and lead to suboptimal
+   storage efficiency.

Review Comment:
   ```suggestion
   .. note::
      To make the most of this feature, you should ensure that Parquet write 
options
      remain consistent across writes and files.
      Using different write options (like compression, encoding, or row group 
size)
      for different files may prevent proper deduplication and lead to 
suboptimal
      storage efficiency.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

Reply via email to