[
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Ivanfi reassigned PARQUET-1337:
--------------------------------------
Assignee: Zoltan Ivanfi
> Implement better estimate of page size for RLE+bitpacking
> ---------------------------------------------------------
>
> Key: PARQUET-1337
> URL: https://issues.apache.org/jira/browse/PARQUET-1337
> Project: Parquet
> Issue Type: Improvement
> Reporter: Gabor Szadovszky
> Assignee: Zoltan Ivanfi
> Priority: Major
>
> If there are many columns with encoding RLE+bitpacking (e.g. dictionary
> encoding) where the value variance is low the estimate of the size of the
> open pages (which are not encoded yet) are much larger than the final page
> size. Because of that parquet-mr fails to create row-groups that size are
> close to {{parquet.block.size}} which causes performance issues while reading.
> A hint from Ryan to solve this issue:
> {quote}
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
> {quote}
> (So, it is not only about RLE+bitpacking but any kind of encoding which is
> done only after "closing" a page.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)