kszucs commented on code in PR #45360:
URL: https://github.com/apache/arrow/pull/45360#discussion_r2005300075
##########
cpp/src/parquet/column_writer.cc:
##########
@@ -1332,13 +1337,38 @@ class TypedColumnWriterImpl : public ColumnWriterImpl,
public TypedColumnWriter<
bits_buffer_->ZeroPadding();
}
- if (leaf_array.type()->id() == ::arrow::Type::DICTIONARY) {
- return WriteArrowDictionary(def_levels, rep_levels, num_levels,
leaf_array, ctx,
- maybe_parent_nulls);
+ if (properties_->cdc_enabled()) {
+ ARROW_ASSIGN_OR_RAISE(auto boundaries,
+ content_defined_chunker_.GetBoundaries(
Review Comment:
> There are some cases where the parquet writer will further split the input
Arrow array into smaller pieces which may affect the precision of the CDC logic
here:
>
> * Split the input for max_row_group_size:
https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/writer.cc#L458-L470
I am aware of this problem, this indeed reduces the efficiency. In case of
multiple row groups the row groups following the modifications will have
changed pages at the beginning and end of them.
A proper solution for this problem is to pick a column from the table and
calculate row group boundaries based on that single column. I have done some
experiments, and the deduplication efficiency drop caused by the fixed sized
row group chunking is not significant, though it also depends on the number of
pages in a row group.
I am planning to further experiment with this after we merge this PR.
> * If data page v2 or page index is enabled, page boundary must be a record
boundary (i.e. rep_level = 0), this prohibits page cut at certain values:
https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_writer.cc#L1142-L1188
The CDC chunker always cuts at record boundaries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]