Re: achieving better compression with Parquet

Kirill Safonov Tue, 15 Mar 2016 15:19:19 -0700

Thanks Ryan,

One more question please: as we’re going to store timestamped events in 
Parquet, would it be beneficial to write the files chronologically sorted? 
Namely, will the query for the certain time range over the time-sorted Parquet 
file be optimised so that irrelevant portion of data is skipped and no "full 
scan" is done?


Kirill

> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote:
> 
> Adding int64-delta should be weeks. We should also open a bug report for
> that line in Spark. It should not fail if an annotation is unsupported. It
> should ignore it.
> 
> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <[email protected]>
> wrote:
> 
>> Thanks for reply Ryan,
>> 
>>> For 2, PLAIN/gzip is the best option for timestamps right now. The format
>>> 2.0 encodings include a delta-integer encoding that we expect to work
>> really well for timestamps, but that hasn't been committed for int64 yet.
>> 
>> Is there any ETA on when it can appear? Just the order e.g. weeks or
>> months?
>> 
>>> Also, it should be safe to store timestamps as int64 using the
>> TIMESTAMP_MILLIS annotation.
>> 
>> Unfortunately this is not the case for us as the Parquet complains with
>> "Parquet type not yet supported" [1].
>> 
>> Thanks,
>> Kirill
>> 
>> [1]:
>> 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>> 
>> -----Original Message-----
>> From: Ryan Blue [mailto:[email protected]]
>> Sent: Monday, March 14, 2016 7:44 PM
>> To: Parquet Dev
>> Subject: Re: achieving better compression with Parquet
>> 
>> Kirill,
>> 
>> For 1, the reported size is just the data size. That doesn't include page
>> headers, statistics, or dictionary pages. You can see the size of the
>> dictionary pages in the dump output, which I would expect is where the
>> majority of the difference is.
>> 
>> For 2, PLAIN/gzip is the best option for timestamps right now. The format
>> 2.0 encodings include a delta-integer encoding that we expect to work
>> really well for timestamps, but that hasn't been committed for int64 yet.
>> 
>> Also, it should be safe to store timestamps as int64 using the
>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
>> values you write represent. When there isn't specific support for it, you
>> should just get an int64. Using that annotation should give you the exact
>> same behavior as not using it right now, but when you update to a version
>> of Spark that supports it you should be able to get timestamps out of your
>> existing data.
>> 
>> rb
>> 
>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <[email protected]>
>> wrote:
>> 
>>> Thanks for the hint Ryan!
>>> 
>>> I applied the tool to the file and I’ve got some more questions if you
>>> don’t mind :-)
>>> 
>>> 1) We’re using 64Mb page (row group) size so I would expect the sum of
>>> all the values in “compressed size” field (which is {x} in
>>> SZ:{x}/{y}/{z}
>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
>>> 2) One of the largest field is Unix timestamp (we may have lots of
>>> timestamps for a single data record) which is written as plain int64
>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
>>> be not yet supported by Spark). The tool says that this column is
>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
>>> afterwards). Is this the most compact way to store timestamps or e.g.
>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
>> improvement?
>>> 
>>> Thanks,
>>> Kirill
>>> 
>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote:
>>>> 
>>>> Hi Kirill,
>>>> 
>>>> It's hard to say what the expected compression rate should be since
>>> that's
>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
>> though.
>>>> 
>>>> For inspecting the files, check out parquet-tools [1]. That can dump
>>>> the metadata from a file all the way down to the page level. The "meta"
>>> command
>>>> will print out each row group and column within those row groups,
>>>> which should give you the info you're looking for.
>>>> 
>>>> rb
>>>> 
>>>> 
>>>> [1]:
>>>> 
>>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
>>> t-tools%7C1.8.1%7Cjar
>>>> 
>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
>>>> <[email protected]
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi guys,
>>>>> 
>>>>> We’re evaluating Parquet as the high compression format for our
>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
>>>>> Parquet
>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
>>> (with
>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
>> data.
>>>>> 
>>>>> So the questions are:
>>>>> 
>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
>>>>> 2) As we specially crafted Parquet schema with maps and lists for
>>> certain
>>>>> fields, is there any tool to show the sizes of individual Parquet
>>> columns
>>>>> so we can find the biggest ones?
>>>>> 
>>>>> Thanks in advance,
>>>>> Kirill
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>> 
>> 
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: achieving better compression with Parquet

Reply via email to