Re: Estimated row-group size is significantly higher than the written one

Gabor Szadovszky Mon, 25 Jun 2018 00:13:18 -0700

Thanks a lot, Ryan. Created the JIRA PARQUET-1337 to track it.

On Sat, Jun 23, 2018 at 1:29 AM Ryan Blue <[email protected]> wrote:


> I think you're right about the cause. The current estimate is what is
> buffered in memory, so it includes all of the intermediate data for the
> last page before it is finalized and compressed.
>
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
>
> rb
>
> On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky <[email protected]> wrote:
>
> > Hi All,
> >
> > One of our customers faced the following issue. parquet.block.size is
> > configured to 128M. (parquet.writer.max-padding is left with the default
> > 8M.) In average 7 row-groups are generated in one block with the sizes
> > ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g.
> 60M
> > only one row-group per block is written but it is a waste of disk space.
> > By investigating the logs it turns out that parquet-mr thinks the
> row-group
> > is already close to 128M so it writes the first one then realize we still
> > have space to write until reaching the block size and so on:
> > INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 >
> > 134,217,728: flushing 484,972 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 >
> 59,814,925:
> > flushing 99,030 records to disk.
> > INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 >
> 43,397,248:
> > flushing 71,848 records to disk.
> > ...
> >
> > My idea about the root cause is that there are many dictionary encoded
> > columns where the value variance is low. When we are approximating the
> > row-group size there are pages which are still open (not encoded yet). If
> > these pages are dictionary encoded we calculate with 4bytes values as the
> > dictionary indexes. But if the variance is low, the RLE and bitpacking
> will
> > decrease the size of these pages dramatically.
> >
> > What do you guys think? Are we able to make the approximation a bit
> better?
> > Do we have some properties that can solve this issue?
> >
> > Thanks a lot,
> > Gabor
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Estimated row-group size is significantly higher than the written one

Reply via email to