Thanks for the hint Ryan!
I applied the tool to the file and I’ve got some more questions if you don’t
mind :-)
1) We’re using 64Mb page (row group) size so I would expect the sum of all the
values in “compressed size” field (which is {x} in SZ:{x}/{y}/{z} notation) to
be around 64 Mb, but it’s near 48 Mb. Is this expected?
2) One of the largest field is Unix timestamp (we may have lots of timestamps
for a single data record) which is written as plain int64 (we refrained from
using OriginalType.TIMESTAMP_MILLIS as it seems to be not yet supported by
Spark). The tool says that this column is stored with “ENC:PLAIN” encoding
(which I suppose is GZipped afterwards). Is this the most compact way to store
timestamps or e.g. giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will
make an improvement?
Thanks,
Kirill
> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote:
>
> Hi Kirill,
>
> It's hard to say what the expected compression rate should be since that's
> heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
>
> For inspecting the files, check out parquet-tools [1]. That can dump the
> metadata from a file all the way down to the page level. The "meta" command
> will print out each row group and column within those row groups, which
> should give you the info you're looking for.
>
> rb
>
>
> [1]:
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar
>
> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <[email protected]>
> wrote:
>
>> Hi guys,
>>
>> We’re evaluating Parquet as the high compression format for our logs. We
>> took some ~850Gb of TSV data (some columns are JSON) and Parquet
>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with
>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
>>
>> So the questions are:
>>
>> 1) is this somewhat expected compression rate (compared to GZip)?
>> 2) As we specially crafted Parquet schema with maps and lists for certain
>> fields, is there any tool to show the sizes of individual Parquet columns
>> so we can find the biggest ones?
>>
>> Thanks in advance,
>> Kirill
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix