kszucs commented on code in PR #45360: URL: https://github.com/apache/arrow/pull/45360#discussion_r1988133292
########## cpp/src/parquet/column_writer.cc: ########## @@ -1332,13 +1337,38 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< bits_buffer_->ZeroPadding(); } - if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) { - return WriteArrowDictionary(def_levels, rep_levels, num_levels, leaf_array, ctx, - maybe_parent_nulls); + if (properties_->cdc_enabled()) { + ARROW_ASSIGN_OR_RAISE(auto boundaries, + content_defined_chunker_.GetBoundaries( + def_levels, rep_levels, num_levels, leaf_array)); + for (auto chunk : boundaries) { + auto chunk_array = leaf_array.Slice(chunk.value_offset); + auto chunk_def_levels = AddIfNotNull(def_levels, chunk.level_offset); + auto chunk_rep_levels = AddIfNotNull(rep_levels, chunk.level_offset); + if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) { Review Comment: Great question. We are dealing with arrow arrays here, so the parquet encodings don't interfere with the chunking. However Arrow's dictionary and REE encodings do matter. "Ideally" we should fed through the "logical" values rather than the encoded ones, but that would have performance implications and I am not sure that the path builder would do the record shredding in that case. I assume the path builder only deals with the dictionary indices. I assume that the dictionary changes are less likely, also considering the performance implications working with indices should be fine for now. I'd need to do more experimentations on this topic, so I'd prefer to defer that to a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org