Hi All,

One of our customers faced the following issue. parquet.block.size is
configured to 128M. (parquet.writer.max-padding is left with the default
8M.) In average 7 row-groups are generated in one block with the sizes
~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. 60M
only one row-group per block is written but it is a waste of disk space.
By investigating the logs it turns out that parquet-mr thinks the row-group
is already close to 128M so it writes the first one then realize we still
have space to write until reaching the block size and so on:
INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
134,217,728: flushing 484,972 records to disk.
INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > 59,814,925:
flushing 99,030 records to disk.
INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > 43,397,248:
flushing 71,848 records to disk.
...

My idea about the root cause is that there are many dictionary encoded
columns where the value variance is low. When we are approximating the
row-group size there are pages which are still open (not encoded yet). If
these pages are dictionary encoded we calculate with 4bytes values as the
dictionary indexes. But if the variance is low, the RLE and bitpacking will
decrease the size of these pages dramatically.

What do you guys think? Are we able to make the approximation a bit better?
Do we have some properties that can solve this issue?

Thanks a lot,
Gabor

Reply via email to