[PR] Implement write.parquet.row-group-size-bytes in the pyarrow writer [iceberg-python]

via GitHub Mon, 01 Jun 2026 15:35:19 -0700


stephrb opened a new pull request, #3449:
URL: https://github.com/apache/iceberg-python/pull/3449


   The pyiceberg writer has historically ignored
   write.parquet.row-group-size-bytes (logging 'not implemented') and used only 
write.parquet.row-group-limit (rows). For wide tables that means a single row 
group ends up at gigabytes — e.g. 337 cols × 1,048,576 default rows ≈ 1.7 GiB 
uncompressed per row group — which drives the polars / pyarrow reader's decode 
peak into the tens of GiB on production reads.
   
   Now write_file resolves row_group_size as
   min(row_group_limit, row_group_size_bytes / bytes_per_row), where 
bytes_per_row is approximated from the in-memory arrow_table's nbytes. This 
matches Spark / parquet-mr 'whichever limit fires first' semantics and lets the 
existing PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT (128 MiB) actually take effect.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Implement write.parquet.row-group-size-bytes in the pyarrow writer [iceberg-python]

Reply via email to