[
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltan Ivanfi updated PARQUET-1337:
-----------------------------------
Description:
When the size of buffered data gets near the desired row group size, Parquet
flushes the data to a row group. However, at this point the data for the last
page is not yet encoded nor compressed, thereby the row group may end up being
significantly smaller than it was intended.
If the row group ends up being so small that it is farther away from the next
disk block boundary than the maximum padding, Parquet will try to create a new
group in the same disk block, this time targeting the remaning space. This may
also be flushed prematurely, leading to the creation of an even smaller row
group, which may lead to an even smaller one... This gets repeated until we get
sufficiently close to the block boundary so that padding can be finally
applied. The resulting superflous row groups can lead to bad performance.
An example of the structure of a Parquet file suffering from this problem can
be seen below. For easier interpretation, the row groups are visually grouped
by disk blocks:
{noformat}
row group 1: RC:18774 TS:22182960 OFFSET: 4
row group 2: RC: 2896 TS: 3428160 OFFSET: 6574564
row group 3: RC: 1964 TS: 2322560 OFFSET: 7679844
row group 4: RC: 1074 TS: 1268880 OFFSET: 8732964
{noformat}
{noformat}
row group 5: RC:18808 TS:22228560 OFFSET:10000000
row group 6: RC: 2872 TS: 3389520 OFFSET:16612640
row group 7: RC: 1930 TS: 2284960 OFFSET:17716800
row group 8: RC: 1040 TS: 1233520 OFFSET:18768240
{noformat}
{noformat}
row group 9: RC:18852 TS:22275520 OFFSET:20000000
row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
{noformat}
{noformat}
row group 13: RC:18841 TS:22263360 OFFSET:30000000
row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
{noformat}
{noformat}
row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
{noformat}
In this example, both the disk block size and the row group size was set to
10000000. The data would fit in 5 row groups of this size, but instead, each of
the disk blocks (except the last) is split into 4 row groups of progressively
decreasing size.
was:
If there are many columns with encoding RLE+bitpacking (e.g. dictionary
encoding) where the value variance is low the estimate of the size of the open
pages (which are not encoded yet) are much larger than the final page size.
Because of that parquet-mr fails to create row-groups that size are close to
{{parquet.block.size}} which causes performance issues while reading.
A hint from Ryan to solve this issue:
{quote}
We could probably get a better estimate by using the amount of buffered
data and how large other pages in a column were after fully encoding and
compressing. So if you have 5 pages compressed and buffered, and another
1000 values, use the compression ratio of the 5 pages to estimate the final
size. We'd probably want to use some overhead value for the header. And,
we'd want to separate the amount of buffered data from our row group size
estimate, which are currently the same thing.
{quote}
(So, it is not only about RLE+bitpacking but any kind of encoding which is done
only after "closing" a page.)
> Current block alignment logic may lead to several row groups per block
> ----------------------------------------------------------------------
>
> Key: PARQUET-1337
> URL: https://issues.apache.org/jira/browse/PARQUET-1337
> Project: Parquet
> Issue Type: Improvement
> Reporter: Gabor Szadovszky
> Assignee: Zoltan Ivanfi
> Priority: Major
>
> When the size of buffered data gets near the desired row group size, Parquet
> flushes the data to a row group. However, at this point the data for the last
> page is not yet encoded nor compressed, thereby the row group may end up
> being significantly smaller than it was intended.
> If the row group ends up being so small that it is farther away from the next
> disk block boundary than the maximum padding, Parquet will try to create a
> new group in the same disk block, this time targeting the remaning space.
> This may also be flushed prematurely, leading to the creation of an even
> smaller row group, which may lead to an even smaller one... This gets
> repeated until we get sufficiently close to the block boundary so that
> padding can be finally applied. The resulting superflous row groups can lead
> to bad performance.
> An example of the structure of a Parquet file suffering from this problem can
> be seen below. For easier interpretation, the row groups are visually grouped
> by disk blocks:
> {noformat}
> row group 1: RC:18774 TS:22182960 OFFSET: 4
> row group 2: RC: 2896 TS: 3428160 OFFSET: 6574564
> row group 3: RC: 1964 TS: 2322560 OFFSET: 7679844
> row group 4: RC: 1074 TS: 1268880 OFFSET: 8732964
> {noformat}
> {noformat}
> row group 5: RC:18808 TS:22228560 OFFSET:10000000
> row group 6: RC: 2872 TS: 3389520 OFFSET:16612640
> row group 7: RC: 1930 TS: 2284960 OFFSET:17716800
> row group 8: RC: 1040 TS: 1233520 OFFSET:18768240
> {noformat}
> {noformat}
> row group 9: RC:18852 TS:22275520 OFFSET:20000000
> row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
> row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
> row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
> {noformat}
> {noformat}
> row group 13: RC:18841 TS:22263360 OFFSET:30000000
> row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
> row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
> row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
> {noformat}
> {noformat}
> row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
> {noformat}
> In this example, both the disk block size and the row group size was set to
> 10000000. The data would fit in 5 row groups of this size, but instead, each
> of the disk blocks (except the last) is split into 4 row groups of
> progressively decreasing size.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)