Re: achieving better compression with Parquet

Ryan Blue Sun, 06 Mar 2016 13:28:33 -0800

Hi Kirill,

It's hard to say what the expected compression rate should be since that's
heavily data-dependent. Sounds like Parquet isn't doing too bad, though.

For inspecting the files, check out parquet-tools [1]. That can dump the
metadata from a file all the way down to the page level. The "meta" command
will print out each row group and column within those row groups, which
should give you the info you're looking for.

rb

[1]:
http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar

On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <[email protected]>
wrote:

> Hi guys,
>
> We’re evaluating Parquet as the high compression format for our logs. We
> took some ~850Gb of TSV data (some columns are JSON) and Parquet
> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with
> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
>
> So the questions are:
>
> 1) is this somewhat expected compression rate (compared to GZip)?
> 2) As we specially crafted Parquet schema with maps and lists for certain
> fields, is there any tool to show the sizes of individual Parquet columns
> so we can find the biggest ones?
>
> Thanks in advance,
>  Kirill

-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Reply via email to