Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

via GitHub Thu, 20 Mar 2025 03:45:20 -0700


kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2005300075



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1332,13 +1337,38 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, 
public TypedColumnWriter<
       bits_buffer_->ZeroPadding();
     }
 
-    if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
-      return WriteArrowDictionary(def_levels, rep_levels, num_levels, 
leaf_array, ctx,
-                                  maybe_parent_nulls);
+    if (properties_->cdc_enabled()) {
+      ARROW_ASSIGN_OR_RAISE(auto boundaries,
+                            content_defined_chunker_.GetBoundaries(

Review Comment:
   > There are some cases where the parquet writer will further split the input 
Arrow array into smaller pieces which may affect the precision of the CDC logic 
here:
   > 
   > * Split the input for max_row_group_size: 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/writer.cc#L458-L470
   I am aware of this problem, this indeed reduces the efficiency. In case of 
multiple row groups the row groups following the modifications will have 
changed pages at the beginning and end of them. 
   
   A proper solution for this problem is to pick a column from the table and 
calculate row group boundaries based on that single column. I have done some 
experiments, and the deduplication efficiency drop caused by the fixed sized 
row group chunking is not significant, though it also depends on the number of 
pages in a row group.
   I am planning to further experiment with this after we merge this PR.
   
   > * If data page v2 or page index is enabled, page boundary must be a record 
boundary (i.e. rep_level = 0), this prohibits page cut at certain values: 
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1142-L1188
   
   The CDC chunker always cuts at record boundaries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

Reply via email to