pan3793 opened a new issue, #3235: URL: https://github.com/apache/parquet-java/issues/3235
### Describe the enhancement requested In some compute engines, like Spark, the row group is the minimally splittable unit for scanning tasks. Currently, only the row group size is configurable (via `parquet.block.size`) when writing Parquet files. In some cases, especially for fewer columns with many duplicated values, one row group may have tons of records, thus causes extremely bad performance on downstream Spark queries. <img width="1710" alt="Image" src="https://github.com/user-attachments/assets/09d29577-c674-4405-a84e-bb30d7107b06" /> I propose to make the row count limit for each row group configurable. [ORC-1172](https://issues.apache.org/jira/browse/ORC-1172) also has a similar configuration. ### Component(s) Core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
