[
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated PARQUET-1337:
------------------------------------
Labels: pull-request-available (was: )
> Current block alignment logic may lead to several row groups per block
> ----------------------------------------------------------------------
>
> Key: PARQUET-1337
> URL: https://issues.apache.org/jira/browse/PARQUET-1337
> Project: Parquet
> Issue Type: Improvement
> Reporter: Gabor Szadovszky
> Assignee: Zoltan Ivanfi
> Priority: Major
> Labels: pull-request-available
>
> When the size of buffered data gets near the desired row group size, Parquet
> flushes the data to a row group. However, at this point the data for the last
> page is not yet encoded nor compressed, thereby the row group may end up
> being significantly smaller than it was intended.
> If the row group ends up being so small that it is farther away from the next
> disk block boundary than the maximum padding, Parquet will try to create a
> new group in the same disk block, this time targeting the remaning space.
> This may also be flushed prematurely, leading to the creation of an even
> smaller row group, which may lead to an even smaller one... This gets
> repeated until we get sufficiently close to the block boundary so that
> padding can be finally applied. The resulting superflous row groups can lead
> to bad performance.
> An example of the structure of a Parquet file suffering from this problem can
> be seen below. For easier interpretation, the row groups are visually grouped
> by disk blocks:
> {noformat}
> row group 1: RC:18774 TS:22182960 OFFSET: 4
> row group 2: RC: 2896 TS: 3428160 OFFSET: 6574564
> row group 3: RC: 1964 TS: 2322560 OFFSET: 7679844
> row group 4: RC: 1074 TS: 1268880 OFFSET: 8732964
> {noformat}
> {noformat}
> row group 5: RC:18808 TS:22228560 OFFSET:10000000
> row group 6: RC: 2872 TS: 3389520 OFFSET:16612640
> row group 7: RC: 1930 TS: 2284960 OFFSET:17716800
> row group 8: RC: 1040 TS: 1233520 OFFSET:18768240
> {noformat}
> {noformat}
> row group 9: RC:18852 TS:22275520 OFFSET:20000000
> row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
> row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
> row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
> {noformat}
> {noformat}
> row group 13: RC:18841 TS:22263360 OFFSET:30000000
> row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
> row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
> row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
> {noformat}
> {noformat}
> row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
> {noformat}
> In this example, both the disk block size and the row group size was set to
> 10000000. The data would fit in 5 row groups of this size, but instead, each
> of the disk blocks (except the last) is split into 4 row groups of
> progressively decreasing size.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)