[ 
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1337:
-----------------------------------
    Description: 
When the size of buffered data gets near the desired row group size, Parquet 
flushes the data to a row group. However, at this point the data for the last 
page is not yet encoded nor compressed, thereby the row group may end up being 
significantly smaller than it was intended.

If the row group ends up being so small that it is farther away from the next 
disk block boundary than the maximum padding, Parquet will try to create a new 
group in the same disk block, this time targeting the remaning space. This may 
also be flushed prematurely, leading to the creation of an even smaller row 
group, which may lead to an even smaller one... This gets repeated until we get 
sufficiently close to the block boundary so that padding can be finally 
applied. The resulting superflous row groups can lead to bad read performance.

An example of the structure of a Parquet file suffering from this problem can 
be seen below. For easier interpretation, the row groups are visually grouped 
by disk blocks:
{noformat}
row group 1:  RC:18774 TS:22182960 OFFSET:       4
row group 2:  RC: 2896 TS: 3428160 OFFSET: 6574564
row group 3:  RC: 1964 TS: 2322560 OFFSET: 7679844
row group 4:  RC: 1074 TS: 1268880 OFFSET: 8732964
{noformat}
{noformat}
row group 5:  RC:18808 TS:22228560 OFFSET:10000000
row group 6:  RC: 2872 TS: 3389520 OFFSET:16612640
row group 7:  RC: 1930 TS: 2284960 OFFSET:17716800
row group 8:  RC: 1040 TS: 1233520 OFFSET:18768240
{noformat}
{noformat}
row group 9:  RC:18852 TS:22275520 OFFSET:20000000
row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
{noformat}
{noformat}
row group 13: RC:18841 TS:22263360 OFFSET:30000000
row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
{noformat}
{noformat}
row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
{noformat}
In this example, both the disk block size and the row group size was set to 
10000000. The data would fit in 5 row groups of this size, but instead, each of 
the disk blocks (except the last) is split into 4 row groups of progressively 
decreasing size.

  was:
When the size of buffered data gets near the desired row group size, Parquet 
flushes the data to a row group. However, at this point the data for the last 
page is not yet encoded nor compressed, thereby the row group may end up being 
significantly smaller than it was intended.

If the row group ends up being so small that it is farther away from the next 
disk block boundary than the maximum padding, Parquet will try to create a new 
group in the same disk block, this time targeting the remaning space. This may 
also be flushed prematurely, leading to the creation of an even smaller row 
group, which may lead to an even smaller one... This gets repeated until we get 
sufficiently close to the block boundary so that padding can be finally 
applied. The resulting superflous row groups can lead to bad performance.

An example of the structure of a Parquet file suffering from this problem can 
be seen below. For easier interpretation, the row groups are visually grouped 
by disk blocks:

{noformat}
row group 1:  RC:18774 TS:22182960 OFFSET:       4
row group 2:  RC: 2896 TS: 3428160 OFFSET: 6574564
row group 3:  RC: 1964 TS: 2322560 OFFSET: 7679844
row group 4:  RC: 1074 TS: 1268880 OFFSET: 8732964
{noformat}
{noformat}
row group 5:  RC:18808 TS:22228560 OFFSET:10000000
row group 6:  RC: 2872 TS: 3389520 OFFSET:16612640
row group 7:  RC: 1930 TS: 2284960 OFFSET:17716800
row group 8:  RC: 1040 TS: 1233520 OFFSET:18768240
{noformat}
{noformat}
row group 9:  RC:18852 TS:22275520 OFFSET:20000000
row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
{noformat}
{noformat}
row group 13: RC:18841 TS:22263360 OFFSET:30000000
row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
{noformat}
{noformat}
row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
{noformat}

In this example, both the disk block size and the row group size was set to 
10000000. The data would fit in 5 row groups of this size, but instead, each of 
the disk blocks (except the last) is split into 4 row groups of progressively 
decreasing size.


> Current block alignment logic may lead to several row groups per block
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-1337
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1337
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Gabor Szadovszky
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>              Labels: pull-request-available
>
> When the size of buffered data gets near the desired row group size, Parquet 
> flushes the data to a row group. However, at this point the data for the last 
> page is not yet encoded nor compressed, thereby the row group may end up 
> being significantly smaller than it was intended.
> If the row group ends up being so small that it is farther away from the next 
> disk block boundary than the maximum padding, Parquet will try to create a 
> new group in the same disk block, this time targeting the remaning space. 
> This may also be flushed prematurely, leading to the creation of an even 
> smaller row group, which may lead to an even smaller one... This gets 
> repeated until we get sufficiently close to the block boundary so that 
> padding can be finally applied. The resulting superflous row groups can lead 
> to bad read performance.
> An example of the structure of a Parquet file suffering from this problem can 
> be seen below. For easier interpretation, the row groups are visually grouped 
> by disk blocks:
> {noformat}
> row group 1:  RC:18774 TS:22182960 OFFSET:       4
> row group 2:  RC: 2896 TS: 3428160 OFFSET: 6574564
> row group 3:  RC: 1964 TS: 2322560 OFFSET: 7679844
> row group 4:  RC: 1074 TS: 1268880 OFFSET: 8732964
> {noformat}
> {noformat}
> row group 5:  RC:18808 TS:22228560 OFFSET:10000000
> row group 6:  RC: 2872 TS: 3389520 OFFSET:16612640
> row group 7:  RC: 1930 TS: 2284960 OFFSET:17716800
> row group 8:  RC: 1040 TS: 1233520 OFFSET:18768240
> {noformat}
> {noformat}
> row group 9:  RC:18852 TS:22275520 OFFSET:20000000
> row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
> row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
> row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
> {noformat}
> {noformat}
> row group 13: RC:18841 TS:22263360 OFFSET:30000000
> row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
> row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
> row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
> {noformat}
> {noformat}
> row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
> {noformat}
> In this example, both the disk block size and the row group size was set to 
> 10000000. The data would fit in 5 row groups of this size, but instead, each 
> of the disk blocks (except the last) is split into 4 row groups of 
> progressively decreasing size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to