pan3793 opened a new issue, #3235:
URL: https://github.com/apache/parquet-java/issues/3235

   ### Describe the enhancement requested
   
   In some compute engines, like Spark, the row group is the minimally 
splittable unit for scanning tasks. Currently, only the row group size is 
configurable (via `parquet.block.size`) when writing Parquet files. In some 
cases, especially for fewer columns with many duplicated values, one row group 
may have tons of records, thus causes extremely bad performance on downstream 
Spark queries.
   
   <img width="1710" alt="Image" 
src="https://github.com/user-attachments/assets/09d29577-c674-4405-a84e-bb30d7107b06";
 />
   
   
   I propose to make the row count limit for each row group configurable. 
[ORC-1172](https://issues.apache.org/jira/browse/ORC-1172) also has a similar 
configuration.
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to