[ 
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1337:
-----------------------------------
    Component/s: parquet-mr

> Current block alignment logic may lead to several row groups per block
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-1337
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1337
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Gabor Szadovszky
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>              Labels: pull-request-available
>
> When the size of buffered data gets near the desired row group size, Parquet 
> flushes the data to a row group. However, at this point the data for the last 
> page is not yet encoded nor compressed, thereby the row group may end up 
> being significantly smaller than it was intended.
> If the row group ends up being so small that it is farther away from the next 
> disk block boundary than the maximum padding, Parquet will try to create a 
> new group in the same disk block, this time targeting the remaning space. 
> This may also be flushed prematurely, leading to the creation of an even 
> smaller row group, which may lead to an even smaller one... This gets 
> repeated until we get sufficiently close to the block boundary so that 
> padding can be finally applied. The resulting superflous row groups can lead 
> to bad performance.
> An example of the structure of a Parquet file suffering from this problem can 
> be seen below. For easier interpretation, the row groups are visually grouped 
> by disk blocks:
> {noformat}
> row group 1:  RC:18774 TS:22182960 OFFSET:       4
> row group 2:  RC: 2896 TS: 3428160 OFFSET: 6574564
> row group 3:  RC: 1964 TS: 2322560 OFFSET: 7679844
> row group 4:  RC: 1074 TS: 1268880 OFFSET: 8732964
> {noformat}
> {noformat}
> row group 5:  RC:18808 TS:22228560 OFFSET:10000000
> row group 6:  RC: 2872 TS: 3389520 OFFSET:16612640
> row group 7:  RC: 1930 TS: 2284960 OFFSET:17716800
> row group 8:  RC: 1040 TS: 1233520 OFFSET:18768240
> {noformat}
> {noformat}
> row group 9:  RC:18852 TS:22275520 OFFSET:20000000
> row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
> row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
> row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
> {noformat}
> {noformat}
> row group 13: RC:18841 TS:22263360 OFFSET:30000000
> row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
> row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
> row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
> {noformat}
> {noformat}
> row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
> {noformat}
> In this example, both the disk block size and the row group size was set to 
> 10000000. The data would fit in 5 row groups of this size, but instead, each 
> of the disk blocks (except the last) is split into 4 row groups of 
> progressively decreasing size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to