Re: achieving better compression with Parquet

Ryan Blue Sun, 08 May 2016 15:22:30 -0700

Right now, the spec supports columns in separate files, but I don't think
that the implementation does. It wouldn't be too hard to make that work,
but I don't think it does today.


For predicate push-down in Spark, I've gotten it working and will be
getting the patches into upstream Spark. It mostly works now with a few
settings, but has disabled string/binary stats filtering because of
PARQUET-251. I am also trying to get a few important patches in to help
when writing Parquet files and avoid OOMs.

rb

On Sat, May 7, 2016 at 2:39 PM, Kirill Safonov <[email protected]>
wrote:

> Hi Ryan, guys,
>
> Let me please follow up on your last answer. Parquet file can be physically
> stored as a single file (written via WriteSupport) or as a folder with a
> collection of "parallel" files (generated by map-reduce or Spark
> via ParquetOutputFormat).
>
> Will a Spark task processing Parquet input benefit equally from min/max
> stats for both cases (single file vs folder)?
>
> Thanks,
>  Kirill
>
> On Wed, Mar 16, 2016 at 8:30 PM, Ryan Blue <[email protected]>
> wrote:
>
> > Kirill,
> >
> > Yes, sorting data by the columns you intend to filter by will definitely
> > help query performance because we keep min/max stats for each column
> chunk
> > and page that are used to eliminate row groups when you are passing
> filters
> > into Parquet.
> >
> > rb
> >
> > On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <
> [email protected]>
> > wrote:
> >
> > > Antwins,
> > >
> > > Typical query for us is something like ‘Select events where [here come
> > > attributes constraints] and timestamp > 2016-03-16 and timestamp <
> > > 2016-03-17’, that’s why I’m asking if this query can benefit from
> > timestamp
> > > ordering.
> > >
> > > > On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote:
> > > >
> > > > Kirill,
> > > >
> > > > I would think that if such a capability is introduced it should be
> > > > `optional` as depending on your query patterns it might make more
> sense
> > > to
> > > > sort on another column.
> > > >
> > > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <
> > > [email protected]>
> > > > wrote:
> > > >
> > > >> Thanks Ryan,
> > > >>
> > > >> One more question please: as we’re going to store timestamped events
> > in
> > > >> Parquet, would it be beneficial to write the files chronologically
> > > sorted?
> > > >> Namely, will the query for the certain time range over the
> time-sorted
> > > >> Parquet file be optimised so that irrelevant portion of data is
> > skipped
> > > and
> > > >> no "full scan" is done?
> > > >>
> > > >> Kirill
> > > >>
> > > >>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]>
> > wrote:
> > > >>>
> > > >>> Adding int64-delta should be weeks. We should also open a bug
> report
> > > for
> > > >>> that line in Spark. It should not fail if an annotation is
> > unsupported.
> > > >> It
> > > >>> should ignore it.
> > > >>>
> > > >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <
> > > >> [email protected]>
> > > >>> wrote:
> > > >>>
> > > >>>> Thanks for reply Ryan,
> > > >>>>
> > > >>>>> For 2, PLAIN/gzip is the best option for timestamps right now.
> The
> > > >> format
> > > >>>>> 2.0 encodings include a delta-integer encoding that we expect to
> > work
> > > >>>> really well for timestamps, but that hasn't been committed for
> int64
> > > >> yet.
> > > >>>>
> > > >>>> Is there any ETA on when it can appear? Just the order e.g. weeks
> or
> > > >>>> months?
> > > >>>>
> > > >>>>> Also, it should be safe to store timestamps as int64 using the
> > > >>>> TIMESTAMP_MILLIS annotation.
> > > >>>>
> > > >>>> Unfortunately this is not the case for us as the Parquet complains
> > > with
> > > >>>> "Parquet type not yet supported" [1].
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Kirill
> > > >>>>
> > > >>>> [1]:
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
> > > >>>>
> > > >>>> -----Original Message-----
> > > >>>> From: Ryan Blue [mailto:[email protected]]
> > > >>>> Sent: Monday, March 14, 2016 7:44 PM
> > > >>>> To: Parquet Dev
> > > >>>> Subject: Re: achieving better compression with Parquet
> > > >>>>
> > > >>>> Kirill,
> > > >>>>
> > > >>>> For 1, the reported size is just the data size. That doesn't
> include
> > > >> page
> > > >>>> headers, statistics, or dictionary pages. You can see the size of
> > the
> > > >>>> dictionary pages in the dump output, which I would expect is where
> > the
> > > >>>> majority of the difference is.
> > > >>>>
> > > >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The
> > > >> format
> > > >>>> 2.0 encodings include a delta-integer encoding that we expect to
> > work
> > > >>>> really well for timestamps, but that hasn't been committed for
> int64
> > > >> yet.
> > > >>>>
> > > >>>> Also, it should be safe to store timestamps as int64 using the
> > > >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of
> what
> > > the
> > > >>>> values you write represent. When there isn't specific support for
> > it,
> > > >> you
> > > >>>> should just get an int64. Using that annotation should give you
> the
> > > >> exact
> > > >>>> same behavior as not using it right now, but when you update to a
> > > >> version
> > > >>>> of Spark that supports it you should be able to get timestamps out
> > of
> > > >> your
> > > >>>> existing data.
> > > >>>>
> > > >>>> rb
> > > >>>>
> > > >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <
> > > >> [email protected]>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Thanks for the hint Ryan!
> > > >>>>>
> > > >>>>> I applied the tool to the file and I’ve got some more questions
> if
> > > you
> > > >>>>> don’t mind :-)
> > > >>>>>
> > > >>>>> 1) We’re using 64Mb page (row group) size so I would expect the
> sum
> > > of
> > > >>>>> all the values in “compressed size” field (which is {x} in
> > > >>>>> SZ:{x}/{y}/{z}
> > > >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this
> > expected?
> > > >>>>> 2) One of the largest field is Unix timestamp (we may have lots
> of
> > > >>>>> timestamps for a single data record) which is written as plain
> > int64
> > > >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it
> seems
> > to
> > > >>>>> be not yet supported by Spark). The tool says that this column is
> > > >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> > > >>>>> afterwards). Is this the most compact way to store timestamps or
> > e.g.
> > > >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make
> an
> > > >>>> improvement?
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>> Kirill
> > > >>>>>
> > > >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]>
> > > >> wrote:
> > > >>>>>>
> > > >>>>>> Hi Kirill,
> > > >>>>>>
> > > >>>>>> It's hard to say what the expected compression rate should be
> > since
> > > >>>>> that's
> > > >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad,
> > > >>>> though.
> > > >>>>>>
> > > >>>>>> For inspecting the files, check out parquet-tools [1]. That can
> > dump
> > > >>>>>> the metadata from a file all the way down to the page level. The
> > > >> "meta"
> > > >>>>> command
> > > >>>>>> will print out each row group and column within those row
> groups,
> > > >>>>>> which should give you the info you're looking for.
> > > >>>>>>
> > > >>>>>> rb
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> [1]:
> > > >>>>>>
> > > >>>>>
> > > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> > > >>>>> t-tools%7C1.8.1%7Cjar
> > > >>>>>>
> > > >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> > > >>>>>> <[email protected]
> > > >>>>>>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi guys,
> > > >>>>>>>
> > > >>>>>>> We’re evaluating Parquet as the high compression format for our
> > > >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON)
> and
> > > >>>>>>> Parquet
> > > >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain
> > GZip
> > > >>>>> (with
> > > >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the
> > same
> > > >>>> data.
> > > >>>>>>>
> > > >>>>>>> So the questions are:
> > > >>>>>>>
> > > >>>>>>> 1) is this somewhat expected compression rate (compared to
> GZip)?
> > > >>>>>>> 2) As we specially crafted Parquet schema with maps and lists
> for
> > > >>>>> certain
> > > >>>>>>> fields, is there any tool to show the sizes of individual
> Parquet
> > > >>>>> columns
> > > >>>>>>> so we can find the biggest ones?
> > > >>>>>>>
> > > >>>>>>> Thanks in advance,
> > > >>>>>>> Kirill
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Ryan Blue
> > > >>>>>> Software Engineer
> > > >>>>>> Netflix
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Ryan Blue
> > > >>>> Software Engineer
> > > >>>> Netflix
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Ryan Blue
> > > >>> Software Engineer
> > > >>> Netflix
> > > >>
> > > >>
> > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
>
> --
>  kir
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Reply via email to