Thanks for the hint Ryan!

I applied the tool to the file and I’ve got some more questions if you don’t 
mind :-)

1) We’re using 64Mb page (row group) size so I would expect the sum of all the 
values in “compressed size” field (which is {x} in SZ:{x}/{y}/{z} notation) to 
be around 64 Mb, but it’s near 48 Mb. Is this expected?
2) One of the largest field is Unix timestamp (we may have lots of timestamps 
for a single data record) which is written as plain int64 (we refrained from 
using OriginalType.TIMESTAMP_MILLIS as it seems to be not yet supported by 
Spark). The tool says that this column is stored with “ENC:PLAIN” encoding 
(which I suppose is GZipped afterwards). Is this the most compact way to store 
timestamps or e.g. giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will 
make an improvement?

Thanks,
 Kirill

> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote:
> 
> Hi Kirill,
> 
> It's hard to say what the expected compression rate should be since that's
> heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
> 
> For inspecting the files, check out parquet-tools [1]. That can dump the
> metadata from a file all the way down to the page level. The "meta" command
> will print out each row group and column within those row groups, which
> should give you the info you're looking for.
> 
> rb
> 
> 
> [1]:
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparquet-tools%7C1.8.1%7Cjar
> 
> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov <[email protected]>
> wrote:
> 
>> Hi guys,
>> 
>> We’re evaluating Parquet as the high compression format for our logs. We
>> took some ~850Gb of TSV data (some columns are JSON) and Parquet
>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip (with
>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
>> 
>> So the questions are:
>> 
>> 1) is this somewhat expected compression rate (compared to GZip)?
>> 2) As we specially crafted Parquet schema with maps and lists for certain
>> fields, is there any tool to show the sizes of individual Parquet columns
>> so we can find the biggest ones?
>> 
>> Thanks in advance,
>> Kirill
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Reply via email to