[
https://issues.apache.org/jira/browse/ARROW-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-15981:
------------------------------------------
Component/s: C++
> [C++][Doc] Better explain chunk_size in Parquet WriteTable api
> --------------------------------------------------------------
>
> Key: ARROW-15981
> URL: https://issues.apache.org/jira/browse/ARROW-15981
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Xinyu Zeng
> Priority: Minor
>
> For the Parquet WriteTable api in C++, it has a parameter called "chunk_size"
> with no explanation in doc. I further dive into the code related to
> "chunk_size". It seems that it is indeed the size of
> parquet::arrow::WriteColumnChunk. But each time before calling
> WriteColumnChunk a new row group will also be created.
> So in summary, the real written row group size = min(chunk_size,
> max_row_group_size). When chunk_size < max_row_group_size, the real row group
> size is equal to chunk_size. This is a little confusing from the user's
> perspective. In the Python binding, row_group_size actually directly passed
> to chunk_size. Perhaps add more descriptions to chunk_size in doc to describe
> the behavior above.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)