[ https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoltan Ivanfi updated PARQUET-1337: ----------------------------------- Component/s: parquet-mr > Current block alignment logic may lead to several row groups per block > ---------------------------------------------------------------------- > > Key: PARQUET-1337 > URL: https://issues.apache.org/jira/browse/PARQUET-1337 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Reporter: Gabor Szadovszky > Assignee: Zoltan Ivanfi > Priority: Major > Labels: pull-request-available > > When the size of buffered data gets near the desired row group size, Parquet > flushes the data to a row group. However, at this point the data for the last > page is not yet encoded nor compressed, thereby the row group may end up > being significantly smaller than it was intended. > If the row group ends up being so small that it is farther away from the next > disk block boundary than the maximum padding, Parquet will try to create a > new group in the same disk block, this time targeting the remaning space. > This may also be flushed prematurely, leading to the creation of an even > smaller row group, which may lead to an even smaller one... This gets > repeated until we get sufficiently close to the block boundary so that > padding can be finally applied. The resulting superflous row groups can lead > to bad performance. > An example of the structure of a Parquet file suffering from this problem can > be seen below. For easier interpretation, the row groups are visually grouped > by disk blocks: > {noformat} > row group 1: RC:18774 TS:22182960 OFFSET: 4 > row group 2: RC: 2896 TS: 3428160 OFFSET: 6574564 > row group 3: RC: 1964 TS: 2322560 OFFSET: 7679844 > row group 4: RC: 1074 TS: 1268880 OFFSET: 8732964 > {noformat} > {noformat} > row group 5: RC:18808 TS:22228560 OFFSET:10000000 > row group 6: RC: 2872 TS: 3389520 OFFSET:16612640 > row group 7: RC: 1930 TS: 2284960 OFFSET:17716800 > row group 8: RC: 1040 TS: 1233520 OFFSET:18768240 > {noformat} > {noformat} > row group 9: RC:18852 TS:22275520 OFFSET:20000000 > row group 10: RC: 2831 TS: 3345680 OFFSET:26656320 > row group 11: RC: 1893 TS: 2244640 OFFSET:27757200 > row group 12: RC: 1008 TS: 1195520 OFFSET:28806560 > {noformat} > {noformat} > row group 13: RC:18841 TS:22263360 OFFSET:30000000 > row group 14: RC: 2835 TS: 3350480 OFFSET:36652000 > row group 15: RC: 1900 TS: 2249040 OFFSET:37753600 > row group 16: RC: 1016 TS: 1198640 OFFSET:38803600 > {noformat} > {noformat} > row group 17: RC: 1466 TS: 1740320 OFFSET:40000000 > {noformat} > In this example, both the disk block size and the row group size was set to > 10000000. The data would fit in 5 row groups of this size, but instead, each > of the disk blocks (except the last) is split into 4 row groups of > progressively decreasing size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)