[ 
https://issues.apache.org/jira/browse/PARQUET-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-1414:
--------------------------------------
    Description: 
For column index based filtering it is important to have enough pages for a 
column. In case of a perfectly matching encoding for the suitable data it can 
happen that all of the values can be encoded in one page (e.g. a column of an 
ascending counter).

With this improvement we would be able to limit the pages by the maximum number 
of rows to be written in it so we would have enough pages for every column.

Based on the benchmarks listed 
[here|https://docs.google.com/spreadsheets/d/1hfQPy8NkGbgGugnHWvIHSzZ-3Q5M7f3Dtf_oD9ACFRg]
 20k seems to be a good choice for the default value.

  was:
For column index based filtering it is important to have enough pages for a 
column. In case of a perfectly matching encoding for the suitable data it can 
happen that all of the values can be encoded in one page (e.g. a column of an 
ascending counter).

With this improvement we would be able to limit the pages by the maximum number 
of rows to be written in it so we would have enough pages for every column. A 
good default value should be benchmarked. For initial, we can use 10k.


> Limit page size based on maximum row count
> ------------------------------------------
>
>                 Key: PARQUET-1414
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1414
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>             Fix For: 1.11.0
>
>
> For column index based filtering it is important to have enough pages for a 
> column. In case of a perfectly matching encoding for the suitable data it can 
> happen that all of the values can be encoded in one page (e.g. a column of an 
> ascending counter).
> With this improvement we would be able to limit the pages by the maximum 
> number of rows to be written in it so we would have enough pages for every 
> column.
> Based on the benchmarks listed 
> [here|https://docs.google.com/spreadsheets/d/1hfQPy8NkGbgGugnHWvIHSzZ-3Q5M7f3Dtf_oD9ACFRg]
>  20k seems to be a good choice for the default value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to