Thanks a lot, Ryan. Created the JIRA PARQUET-1337 to track it. On Sat, Jun 23, 2018 at 1:29 AM Ryan Blue <[email protected]> wrote:
> I think you're right about the cause. The current estimate is what is > buffered in memory, so it includes all of the intermediate data for the > last page before it is finalized and compressed. > > We could probably get a better estimate by using the amount of buffered > data and how large other pages in a column were after fully encoding and > compressing. So if you have 5 pages compressed and buffered, and another > 1000 values, use the compression ratio of the 5 pages to estimate the final > size. We'd probably want to use some overhead value for the header. And, > we'd want to separate the amount of buffered data from our row group size > estimate, which are currently the same thing. > > rb > > On Thu, Jun 21, 2018 at 1:17 AM Gabor Szadovszky <[email protected]> wrote: > > > Hi All, > > > > One of our customers faced the following issue. parquet.block.size is > > configured to 128M. (parquet.writer.max-padding is left with the default > > 8M.) In average 7 row-groups are generated in one block with the sizes > > ~74M, ~16M, ~12M, ~9M, ~7M, ~5M, ~4M. By increasing the padding to e.g. > 60M > > only one row-group per block is written but it is a waste of disk space. > > By investigating the logs it turns out that parquet-mr thinks the > row-group > > is already close to 128M so it writes the first one then realize we still > > have space to write until reaching the block size and so on: > > INFO hadoop.InternalParquetRecordWriter: mem size 134,673,545 > > > 134,217,728: flushing 484,972 records to disk. > > INFO hadoop.InternalParquetRecordWriter: mem size 59,814,120 > > 59,814,925: > > flushing 99,030 records to disk. > > INFO hadoop.InternalParquetRecordWriter: mem size 43,396,192 > > 43,397,248: > > flushing 71,848 records to disk. > > ... > > > > My idea about the root cause is that there are many dictionary encoded > > columns where the value variance is low. When we are approximating the > > row-group size there are pages which are still open (not encoded yet). If > > these pages are dictionary encoded we calculate with 4bytes values as the > > dictionary indexes. But if the variance is low, the RLE and bitpacking > will > > decrease the size of these pages dramatically. > > > > What do you guys think? Are we able to make the approximation a bit > better? > > Do we have some properties that can solve this issue? > > > > Thanks a lot, > > Gabor > > > > > -- > Ryan Blue > Software Engineer > Netflix >
