jorisvandenbossche commented on issue #34374:
URL: https://github.com/apache/arrow/issues/34374#issuecomment-1449913024

   Something else (while trying to compare specifying the different row group 
sizes), it seems specifying the row group size no longer works to get something 
else as the default.
   
   Using released pyarrow 11:
   
   ```python
   >>> pa.__version__
   '11.0.0'
   >>> import pyarrow.parquet as pq
   # table of 9_980_000 (~10M) rows
   >>> table = pq.read_table("nyc_taxi_sample.parquet").combine_chunks()
   # default is still 64M -> 1 row group
   >>> pq.write_table(table, "test_pa11_row_group_size_None.parquet")
   >>> pq.read_metadata("test_pa11_row_group_size_None.parquet").num_row_groups
   1
   # specify the new default manually -> correctly get 10 row groups
   >>> pq.write_table(table, "test_pa11_row_group_size_1M.parquet", 
row_group_size=1024*1024)
   >>> pq.read_metadata("test_pa11_row_group_size_1M.parquet").num_row_groups
   10
   # specify old default
   >>> pq.write_table(table, "test_pa11_row_group_size_64M.parquet", 
row_group_size=1024*1024*64)
   >>> pq.read_metadata("test_pa11_row_group_size_64M.parquet").num_row_groups
   1
   ```
   
   While with latest main:
   
   ```
   >>> pa.__version__
   '12.0.0.dev171+gf9a1d198f.d20230301'
   >>> import pyarrow.parquet as pq
   >>> table = pq.read_table("nyc_taxi_sample.parquet").combine_chunks()
   # default now gives 10 row groups
   >>> pq.write_table(table, "test_pa12_row_group_size_None.parquet")
   >>> pq.read_metadata("test_pa12_row_group_size_None.parquet").num_row_groups
   10
   # specifying old default of 64M -> still gives 10 and not 1 row group
   >>> pq.write_table(table, "test_pa12_row_group_size_64M.parquet", 
row_group_size=1024*1024*64)
   >>> pq.read_metadata("test_pa12_row_group_size_64M.parquet").num_row_groups
   10
   ```
   
   I don't see how this could have been caused by 
https://github.com/apache/arrow/pull/34281, but certainly something we should 
fix as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to