Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

via GitHub Sat, 10 May 2025 11:42:53 -0700


kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2083265543



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1337,13 +1368,47 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
       bits_buffer_->ZeroPadding();
     }
 
-    if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
-      return WriteArrowDictionary(def_levels, rep_levels, num_levels, 
leaf_array, ctx,
-                                  maybe_parent_nulls);
+    if (properties_->content_defined_chunking_enabled()) {
+      DCHECK(content_defined_chunker_.has_value());
+      auto chunks = content_defined_chunker_->GetChunks(def_levels, rep_levels,

Review Comment:
   I am not sure, maybe? I tend to think than having more than a single chunk 
is more likely but it depends on how `WriteArrow` is being called. On the other 
hand there is always a first chunk where `offset` is 0. 
   
   `WriteArrowDense` and `WriteArrowDictionary` also do an unconditional array 
slice, so this optimization (when `value_offset == 0`) could be added there as 
well, though it would be nice to μ-benchmark it first. Could we defer it to a 
follow-up optimization task?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-45750: [C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer [arrow]

Reply via email to