Re: achieving better compression with Parquet

Ryan Blue Sat, 19 Mar 2016 00:52:40 -0700

Kirill,

Yes, sorting data by the columns you intend to filter by will definitely
help query performance because we keep min/max stats for each column chunk
and page that are used to eliminate row groups when you are passing filters
into Parquet.


rb

On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <[email protected]>
wrote:

> Antwins,
>
> Typical query for us is something like ‘Select events where [here come
> attributes constraints] and timestamp > 2016-03-16 and timestamp <
> 2016-03-17’, that’s why I’m asking if this query can benefit from timestamp
> ordering.
>
> > On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote:
> >
> > Kirill,
> >
> > I would think that if such a capability is introduced it should be
> > `optional` as depending on your query patterns it might make more sense
> to
> > sort on another column.
> >
> > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <
> [email protected]>
> > wrote:
> >
> >> Thanks Ryan,
> >>
> >> One more question please: as we’re going to store timestamped events in
> >> Parquet, would it be beneficial to write the files chronologically
> sorted?
> >> Namely, will the query for the certain time range over the time-sorted
> >> Parquet file be optimised so that irrelevant portion of data is skipped
> and
> >> no "full scan" is done?
> >>
> >> Kirill
> >>
> >>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote:
> >>>
> >>> Adding int64-delta should be weeks. We should also open a bug report
> for
> >>> that line in Spark. It should not fail if an annotation is unsupported.
> >> It
> >>> should ignore it.
> >>>
> >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>> Thanks for reply Ryan,
> >>>>
> >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> >> format
> >>>>> 2.0 encodings include a delta-integer encoding that we expect to work
> >>>> really well for timestamps, but that hasn't been committed for int64
> >> yet.
> >>>>
> >>>> Is there any ETA on when it can appear? Just the order e.g. weeks or
> >>>> months?
> >>>>
> >>>>> Also, it should be safe to store timestamps as int64 using the
> >>>> TIMESTAMP_MILLIS annotation.
> >>>>
> >>>> Unfortunately this is not the case for us as the Parquet complains
> with
> >>>> "Parquet type not yet supported" [1].
> >>>>
> >>>> Thanks,
> >>>> Kirill
> >>>>
> >>>> [1]:
> >>>>
> >>>>
> >>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> >>>>
> >>>> -----Original Message-----
> >>>> From: Ryan Blue [mailto:[email protected]]
> >>>> Sent: Monday, March 14, 2016 7:44 PM
> >>>> To: Parquet Dev
> >>>> Subject: Re: achieving better compression with Parquet
> >>>>
> >>>> Kirill,
> >>>>
> >>>> For 1, the reported size is just the data size. That doesn't include
> >> page
> >>>> headers, statistics, or dictionary pages. You can see the size of the
> >>>> dictionary pages in the dump output, which I would expect is where the
> >>>> majority of the difference is.
> >>>>
> >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> >> format
> >>>> 2.0 encodings include a delta-integer encoding that we expect to work
> >>>> really well for timestamps, but that hasn't been committed for int64
> >> yet.
> >>>>
> >>>> Also, it should be safe to store timestamps as int64 using the
> >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what
> the
> >>>> values you write represent. When there isn't specific support for it,
> >> you
> >>>> should just get an int64. Using that annotation should give you the
> >> exact
> >>>> same behavior as not using it right now, but when you update to a
> >> version
> >>>> of Spark that supports it you should be able to get timestamps out of
> >> your
> >>>> existing data.
> >>>>
> >>>> rb
> >>>>
> >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> >> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> Thanks for the hint Ryan!
> >>>>>
> >>>>> I applied the tool to the file and I’ve got some more questions if
> you
> >>>>> don’t mind :-)
> >>>>>
> >>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum
> of
> >>>>> all the values in “compressed size” field (which is {x} in
> >>>>> SZ:{x}/{y}/{z}
> >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> >>>>> 2) One of the largest field is Unix timestamp (we may have lots of
> >>>>> timestamps for a single data record) which is written as plain int64
> >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
> >>>>> be not yet supported by Spark). The tool says that this column is
> >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> >>>>> afterwards). Is this the most compact way to store timestamps or e.g.
> >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> >>>> improvement?
> >>>>>
> >>>>> Thanks,
> >>>>> Kirill
> >>>>>
> >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]>
> >> wrote:
> >>>>>>
> >>>>>> Hi Kirill,
> >>>>>>
> >>>>>> It's hard to say what the expected compression rate should be since
> >>>>> that's
> >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> >>>> though.
> >>>>>>
> >>>>>> For inspecting the files, check out parquet-tools [1]. That can dump
> >>>>>> the metadata from a file all the way down to the page level. The
> >> "meta"
> >>>>> command
> >>>>>> will print out each row group and column within those row groups,
> >>>>>> which should give you the info you're looking for.
> >>>>>>
> >>>>>> rb
> >>>>>>
> >>>>>>
> >>>>>> [1]:
> >>>>>>
> >>>>>
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> >>>>> t-tools%7C1.8.1%7Cjar
> >>>>>>
> >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> >>>>>> <[email protected]
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi guys,
> >>>>>>>
> >>>>>>> We’re evaluating Parquet as the high compression format for our
> >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> >>>>>>> Parquet
> >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> >>>>> (with
> >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
> >>>> data.
> >>>>>>>
> >>>>>>> So the questions are:
> >>>>>>>
> >>>>>>> 1) is this somewhat expected compression rate (compared to GZip)?
> >>>>>>> 2) As we specially crafted Parquet schema with maps and lists for
> >>>>> certain
> >>>>>>> fields, is there any tool to show the sizes of individual Parquet
> >>>>> columns
> >>>>>>> so we can find the biggest ones?
> >>>>>>>
> >>>>>>> Thanks in advance,
> >>>>>>> Kirill
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Ryan Blue
> >>>>>> Software Engineer
> >>>>>> Netflix
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Ryan Blue
> >>> Software Engineer
> >>> Netflix
> >>
> >>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Reply via email to