westonpace commented on code in PR #34435:
URL: https://github.com/apache/arrow/pull/34435#discussion_r1224462453
##########
python/pyarrow/_parquet.pyx:
##########
@@ -1597,6 +1597,15 @@ cdef shared_ptr[WriterProperties]
_create_writer_properties(
props.encryption(
(<FileEncryptionProperties>encryption_properties).unwrap())
+ # For backwards compatibility reasons we cap the maximum row group size
+ # at 64Mi rows. This could be changed in the future, though it would be
+ # a breaking change.
+ #
+ # The user can always specify a smaller row group size (and the default
+ # is smaller) when calling write_table. If the call to write_table uses
+ # a size larger than this then it will be latched to this value.
+ props.max_row_group_length(64*1024*1024)
Review Comment:
Sort of. There are two properties which is what I think causes the
confusion:
`parquet::arrow::WriteTable::chunk_size` and
`parquet::WriterProperties::max_row_group_length`.
I think it is ok for `max_row_group_length` to always be 64Mi. This is what
the comment is justifying. However, we now need to change the default
`chunk_size` to be 1Mi (this is what the default is in C++). Currently pyarrow
uses `table.num_rows()` as a default.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]