[ https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltan Ivanfi updated PARQUET-1337: ----------------------------------- Summary: Current block alignment logic may lead to several row groups per block (was: Implement better estimate of page size for RLE+bitpacking) > Current block alignment logic may lead to several row groups per block > ---------------------------------------------------------------------- > > Key: PARQUET-1337 > URL: https://issues.apache.org/jira/browse/PARQUET-1337 > Project: Parquet > Issue Type: Improvement > Reporter: Gabor Szadovszky > Assignee: Zoltan Ivanfi > Priority: Major > > If there are many columns with encoding RLE+bitpacking (e.g. dictionary > encoding) where the value variance is low the estimate of the size of the > open pages (which are not encoded yet) are much larger than the final page > size. Because of that parquet-mr fails to create row-groups that size are > close to {{parquet.block.size}} which causes performance issues while reading. > A hint from Ryan to solve this issue: > {quote} > We could probably get a better estimate by using the amount of buffered > data and how large other pages in a column were after fully encoding and > compressing. So if you have 5 pages compressed and buffered, and another > 1000 values, use the compression ratio of the 5 pages to estimate the final > size. We'd probably want to use some overhead value for the header. And, > we'd want to separate the amount of buffered data from our row group size > estimate, which are currently the same thing. > {quote} > (So, it is not only about RLE+bitpacking but any kind of encoding which is > done only after "closing" a page.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)