Re: achieving better compression with Parquet

Kirill Safonov Wed, 16 Mar 2016 01:08:47 -0700

Antwins,

Typical query for us is something like ‘Select events where [here come 
attributes constraints] and timestamp > 2016-03-16 and timestamp < 2016-03-17’, 
that’s why I’m asking if this query can benefit from timestamp ordering.


> On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote:
> 
> Kirill,
> 
> I would think that if such a capability is introduced it should be
> `optional` as depending on your query patterns it might make more sense to
> sort on another column.
> 
> On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <[email protected]>
> wrote:
> 
>> Thanks Ryan,
>> 
>> One more question please: as we’re going to store timestamped events in
>> Parquet, would it be beneficial to write the files chronologically sorted?
>> Namely, will the query for the certain time range over the time-sorted
>> Parquet file be optimised so that irrelevant portion of data is skipped and
>> no "full scan" is done?
>> 
>> Kirill
>> 
>>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote:
>>> 
>>> Adding int64-delta should be weeks. We should also open a bug report for
>>> that line in Spark. It should not fail if an annotation is unsupported.
>> It
>>> should ignore it.
>>> 
>>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
>> [email protected]>
>>> wrote:
>>> 
>>>> Thanks for reply Ryan,
>>>> 
>>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
>> format
>>>>> 2.0 encodings include a delta-integer encoding that we expect to work
>>>> really well for timestamps, but that hasn't been committed for int64
>> yet.
>>>> 
>>>> Is there any ETA on when it can appear? Just the order e.g. weeks or
>>>> months?
>>>> 
>>>>> Also, it should be safe to store timestamps as int64 using the
>>>> TIMESTAMP_MILLIS annotation.
>>>> 
>>>> Unfortunately this is not the case for us as the Parquet complains with
>>>> "Parquet type not yet supported" [1].
>>>> 
>>>> Thanks,
>>>> Kirill
>>>> 
>>>> [1]:
>>>> 
>>>> 
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>>>> 
>>>> -----Original Message-----
>>>> From: Ryan Blue [mailto:[email protected]]
>>>> Sent: Monday, March 14, 2016 7:44 PM
>>>> To: Parquet Dev
>>>> Subject: Re: achieving better compression with Parquet
>>>> 
>>>> Kirill,
>>>> 
>>>> For 1, the reported size is just the data size. That doesn't include
>> page
>>>> headers, statistics, or dictionary pages. You can see the size of the
>>>> dictionary pages in the dump output, which I would expect is where the
>>>> majority of the difference is.
>>>> 
>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
>> format
>>>> 2.0 encodings include a delta-integer encoding that we expect to work
>>>> really well for timestamps, but that hasn't been committed for int64
>> yet.
>>>> 
>>>> Also, it should be safe to store timestamps as int64 using the
>>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
>>>> values you write represent. When there isn't specific support for it,
>> you
>>>> should just get an int64. Using that annotation should give you the
>> exact
>>>> same behavior as not using it right now, but when you update to a
>> version
>>>> of Spark that supports it you should be able to get timestamps out of
>> your
>>>> existing data.
>>>> 
>>>> rb
>>>> 
>>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
>> [email protected]>
>>>> wrote:
>>>> 
>>>>> Thanks for the hint Ryan!
>>>>> 
>>>>> I applied the tool to the file and I’ve got some more questions if you
>>>>> don’t mind :-)
>>>>> 
>>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum of
>>>>> all the values in “compressed size” field (which is {x} in
>>>>> SZ:{x}/{y}/{z}
>>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
>>>>> 2) One of the largest field is Unix timestamp (we may have lots of
>>>>> timestamps for a single data record) which is written as plain int64
>>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
>>>>> be not yet supported by Spark). The tool says that this column is
>>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
>>>>> afterwards). Is this the most compact way to store timestamps or e.g.
>>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
>>>> improvement?
>>>>> 
>>>>> Thanks,
>>>>> Kirill
>>>>> 
>>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]>
>> wrote:
>>>>>> 
>>>>>> Hi Kirill,
>>>>>> 
>>>>>> It's hard to say what the expected compression rate should be since
>>>>> that's
>>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
>>>> though.
>>>>>> 
>>>>>> For inspecting the files, check out parquet-tools [1]. That can dump
>>>>>> the metadata from a file all the way down to the page level. The
>> "meta"
>>>>> command
>>>>>> will print out each row group and column within those row groups,
>>>>>> which should give you the info you're looking for.
>>>>>> 
>>>>>> rb
>>>>>> 
>>>>>> 
>>>>>> [1]:
>>>>>> 
>>>>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
>>>>> t-tools%7C1.8.1%7Cjar
>>>>>> 
>>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
>>>>>> <[email protected]
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi guys,
>>>>>>> 
>>>>>>> We’re evaluating Parquet as the high compression format for our
>>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
>>>>>>> Parquet
>>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
>>>>> (with
>>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
>>>> data.
>>>>>>> 
>>>>>>> So the questions are:
>>>>>>> 
>>>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
>>>>>>> 2) As we specially crafted Parquet schema with maps and lists for
>>>>> certain
>>>>>>> fields, is there any tool to show the sizes of individual Parquet
>>>>> columns
>>>>>>> so we can find the biggest ones?
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> Kirill
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>>

Re: achieving better compression with Parquet

Reply via email to