Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

via GitHub Mon, 10 Mar 2025 16:03:48 -0700


kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r1988133292



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1332,13 +1337,38 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, 
public TypedColumnWriter<
       bits_buffer_->ZeroPadding();
     }
 
-    if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
-      return WriteArrowDictionary(def_levels, rep_levels, num_levels, 
leaf_array, ctx,
-                                  maybe_parent_nulls);
+    if (properties_->cdc_enabled()) {
+      ARROW_ASSIGN_OR_RAISE(auto boundaries,
+                            content_defined_chunker_.GetBoundaries(
+                                def_levels, rep_levels, num_levels, 
leaf_array));
+      for (auto chunk : boundaries) {
+        auto chunk_array = leaf_array.Slice(chunk.value_offset);
+        auto chunk_def_levels = AddIfNotNull(def_levels, chunk.level_offset);
+        auto chunk_rep_levels = AddIfNotNull(rep_levels, chunk.level_offset);
+        if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {

Review Comment:
   Great question. We are dealing with arrow arrays here, so the parquet 
encodings don't interfere with the chunking. However Arrow's dictionary and REE 
encodings do matter.
   
   "Ideally" we should fed through the "logical" values rather than the encoded 
ones, but that would have performance implications and I am not sure that the 
path builder would do the record shredding in that case. I assume the path 
builder only deals with the dictionary indices. 
   
   I assume that the dictionary changes are less likely, also considering the 
performance implications working with indices should be fine for now. I'd need 
to do more experimentations on this topic, so I'd prefer to defer that to a 
follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

Reply via email to