[ 
https://issues.apache.org/jira/browse/ARROW-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15981:
------------------------------------------
    Component/s: C++

> [C++][Doc] Better explain chunk_size in Parquet WriteTable api
> --------------------------------------------------------------
>
>                 Key: ARROW-15981
>                 URL: https://issues.apache.org/jira/browse/ARROW-15981
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Xinyu Zeng
>            Priority: Minor
>
> For the Parquet WriteTable api in C++, it has a parameter called "chunk_size" 
> with no explanation in doc. I further dive into the code related to 
> "chunk_size". It seems that it is indeed the size of 
> parquet::arrow::WriteColumnChunk. But each time before calling 
> WriteColumnChunk a new row group will also be created. 
> So in summary, the real written row group size = min(chunk_size, 
> max_row_group_size). When chunk_size < max_row_group_size, the real row group 
> size is equal to chunk_size. This is a little confusing from the user's 
> perspective. In the Python binding, row_group_size actually directly passed 
> to chunk_size. Perhaps add more descriptions to chunk_size in doc to describe 
> the behavior above.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to