Satyr09 opened a new issue, #22650: URL: https://github.com/apache/datafusion/issues/22650
### Is your feature request related to a problem or challenge? DataFusion's Parquet writer only exposes a row-count limit for row group sizing, via ParquetOptions.max_row_group_size (datafusion.execution.parquet.max_row_group_size, default 1M rows). There is no way to bound a row group by bytes. A row count could be a poor proxy for row group size depending on your workload, because bytes-per-row varies widely with schema width. The same max_row_group_size = 1M yields a small row group for a narrow schema and a multi-hundred-MB row group for a wide one. ### Describe the solution you'd like Add an optional `max_row_group_bytes` to `ParquetOptions`, wired to `WriterPropertiesBuilder::set_max_row_group_bytes`. ### Describe alternatives you've considered _No response_ ### Additional context The capability is already available on DataFusion main, so no dependency bump is required. I have an implementation ready (config field, WriterPropertiesBuilder wiring, round-trip tests, and docs) and can open a PR against this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
