[
https://issues.apache.org/jira/browse/ARROW-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526047#comment-17526047
]
Weston Pace edited comment on ARROW-16240 at 4/21/22 7:42 PM:
--------------------------------------------------------------
Your understanding is correct. I think {{max_rows_per_group}} is the correct
choice here. Each call to {{Write}} (e.g. one go) results in
{noformat}
parquet_writer_->WriteTable(*table, batch->num_rows())
{noformat}
so it will create a new parquet row group.
It might also be useful to also set {{min_rows_per_group}} to
{{row_group_size}} but that would be a change in behavior so maybe we shouldn't
do this too (the legacy behavior would just write tiny groups in this case).
was (Author: westonpace):
Your understanding is correct. I think {{max_rows_per_group}} is the correct
choice here. Each call to {{Write}} (e.g. one go) results in
{{parquet_writer_->WriteTable(*table, batch->num_rows())}} so it will create a
new parquet row group.
It might also be useful to also set {{min_rows_per_group}} to
{{row_group_size}} but that would be a change in behavior so maybe we shouldn't
do this too (the legacy behavior would just write tiny groups in this case).
> [Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset
> with use_legacy_dataset=False
> -------------------------------------------------------------------------------------------------------
>
> Key: ARROW-16240
> URL: https://issues.apache.org/jira/browse/ARROW-16240
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: Python
> Reporter: Alenka Frim
> Priority: Major
> Fix For: 8.0.0
>
>
> The {{pq.write_to_dataset}} (legacy implementation) supports the
> {{row_group_size}}/{{chunk_size}} keyword to specify the row group size of
> the written parquet files.
> Now that we made {{use_legacy_dataset=False}} the default, this keyword
> doesn't work anymore.
> This is because {{dataset.write_dataset(..)}} doesn't support the parquet
> {{row_group_size}} keyword. The {{ParquetFileWriteOptions}} class doesn't
> support this keyword.
> On the parquet side, this is also the only keyword that is not passed to the
> {{ParquetWriter}} init (and thus to parquet's {{WriterProperties}} or
> {{ArrowWriterProperties}}), but to the actual {{write_table}} call. In C++
> this can be seen at
> https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71
> See discussion:
> [https://github.com/apache/arrow/pull/12811#discussion_r845304218]
--
This message was sent by Atlassian Jira
(v8.20.7#820007)