[ https://issues.apache.org/jira/browse/PARQUET-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16653995#comment-16653995 ]
Ryan Blue commented on PARQUET-1414: ------------------------------------ [~gszadovszky], can you add a link to your benchmarks to this issue? I think the conclusion we came to while discussing was between 10k and 20k, with 20k being the better choice for overall file size. Is 20k the planned default now? > Limit page size based on maximum row count > ------------------------------------------ > > Key: PARQUET-1414 > URL: https://issues.apache.org/jira/browse/PARQUET-1414 > Project: Parquet > Issue Type: Improvement > Reporter: Gabor Szadovszky > Assignee: Gabor Szadovszky > Priority: Major > Fix For: 1.11.0 > > > For column index based filtering it is important to have enough pages for a > column. In case of a perfectly matching encoding for the suitable data it can > happen that all of the values can be encoded in one page (e.g. a column of an > ascending counter). > With this improvement we would be able to limit the pages by the maximum > number of rows to be written in it so we would have enough pages for every > column. A good default value should be benchmarked. For initial, we can use > 10k. -- This message was sent by Atlassian JIRA (v7.6.3#76005)