Xinyu Zeng created ARROW-15981:
----------------------------------

             Summary: [C++][Doc]Better explain chunk_size in Parquet WriteTable 
api
                 Key: ARROW-15981
                 URL: https://issues.apache.org/jira/browse/ARROW-15981
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Xinyu Zeng


For the Parquet WriteTable api in C++, it has a parameter called "chunk_size" 
with no explanation in doc. I further dive into the code related to 
"chunk_size". It seems that it is indeed the size of 
parquet::arrow::WriteColumnChunk. But each time before calling WriteColumnChunk 
a new row group will also be created. 

So in summary, the real written row group size = min(chunk_size, 
max_row_group_size). When chunk_size < max_row_group_size, the real row group 
size is equal to chunk_size. This is a little confusing from the user's 
perspective. In the Python binding, row_group_size actually directly passed to 
chunk_size. Perhaps add more descriptions to chunk_size in doc to describe the 
behavior above.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to