kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2083108567


##########
cpp/src/parquet/properties.h:
##########
@@ -245,6 +245,34 @@ class PARQUET_EXPORT ColumnProperties {
   bool page_index_enabled_;
 };
 
+// EXPERIMENTAL: Options for content-defined chunking.
+struct PARQUET_EXPORT CdcOptions {
+  /// Minimum chunk size in bytes, default 256 KiB
+  /// The rolling hash will not be updated until this size is reached for each 
chunk.
+  /// Note that all data sent through the hash function is counted towards the 
chunk
+  /// size, including definition and repetition levels if present.
+  int64_t min_chunk_size;
+  /// Maximum chunk size in bytes, default is 1024 KiB
+  /// The chunker will create a new chunk whenever the chunk size exceeds this 
value.
+  /// Note that the parquet writer has a related `pagesize` property that 
controls
+  /// the maximum size of a parquet data page after encoding. While setting
+  /// `pagesize` to a smaller value than `max_chunk_size` doesn't affect the
+  /// chunking effectiveness, it results in more small parquet data pages.
+  int64_t max_chunk_size;
+  /// Number of bit adjustement to the gearhash mask in order to
+  /// center the chunk size around the average size more aggressively, default 0
+  /// Increasing the normalization factor increases the probability of finding 
a chunk,
+  /// improving the deduplication ratio, but also increasing the number of 
small chunks
+  /// resulting in many small parquet data pages. The default value provides a 
good
+  /// balance between deduplication ratio and fragmentation. Use norm_factor=1 
or
+  /// norm_factor=2 to reach a higher deduplication ratio at the expense of
+  /// fragmentation. Negative values can also be used to reduce the 
probability of
+  /// finding a chunk, resulting in larger chunks and fewer data pages.
+  int norm_factor = 0;

Review Comment:
   Not on `norm_level` itself (renamed it from `norm_factor`), but on the 
calculated mask bits which depends on the `min_chunk_size` and `max_chunk_size` 
values and we do raise if the effective bit number is 0 or 64. 
   
   I added a note to the doctstring that values outside of `[-3, 3]` range are 
not recommended and to prefer using the default `norm_level=0`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to