Re: achieving better compression with Parquet

Ryan Blue Mon, 14 Mar 2016 12:01:22 -0700

Adding int64-delta should be weeks. We should also open a bug report for
that line in Spark. It should not fail if an annotation is unsupported. It
should ignore it.


On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <[email protected]>
wrote:

> Thanks for reply Ryan,
>
> > For 2, PLAIN/gzip is the best option for timestamps right now. The format
> > 2.0 encodings include a delta-integer encoding that we expect to work
> really well for timestamps, but that hasn't been committed for int64 yet.
>
> Is there any ETA on when it can appear? Just the order e.g. weeks or
> months?
>
> > Also, it should be safe to store timestamps as int64 using the
> TIMESTAMP_MILLIS annotation.
>
> Unfortunately this is not the case for us as the Parquet complains with
> "Parquet type not yet supported" [1].
>
> Thanks,
>  Kirill
>
> [1]:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161
>
> -----Original Message-----
> From: Ryan Blue [mailto:[email protected]]
> Sent: Monday, March 14, 2016 7:44 PM
> To: Parquet Dev
> Subject: Re: achieving better compression with Parquet
>
> Kirill,
>
> For 1, the reported size is just the data size. That doesn't include page
> headers, statistics, or dictionary pages. You can see the size of the
> dictionary pages in the dump output, which I would expect is where the
> majority of the difference is.
>
> For 2, PLAIN/gzip is the best option for timestamps right now. The format
> 2.0 encodings include a delta-integer encoding that we expect to work
> really well for timestamps, but that hasn't been committed for int64 yet.
>
> Also, it should be safe to store timestamps as int64 using the
> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the
> values you write represent. When there isn't specific support for it, you
> should just get an int64. Using that annotation should give you the exact
> same behavior as not using it right now, but when you update to a version
> of Spark that supports it you should be able to get timestamps out of your
> existing data.
>
> rb
>
> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <[email protected]>
> wrote:
>
> > Thanks for the hint Ryan!
> >
> > I applied the tool to the file and I’ve got some more questions if you
> > don’t mind :-)
> >
> > 1) We’re using 64Mb page (row group) size so I would expect the sum of
> > all the values in “compressed size” field (which is {x} in
> > SZ:{x}/{y}/{z}
> > notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> > 2) One of the largest field is Unix timestamp (we may have lots of
> > timestamps for a single data record) which is written as plain int64
> > (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to
> > be not yet supported by Spark). The tool says that this column is
> > stored with “ENC:PLAIN” encoding (which I suppose is GZipped
> > afterwards). Is this the most compact way to store timestamps or e.g.
> > giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an
> improvement?
> >
> > Thanks,
> >  Kirill
> >
> > > On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote:
> > >
> > > Hi Kirill,
> > >
> > > It's hard to say what the expected compression rate should be since
> > that's
> > > heavily data-dependent. Sounds like Parquet isn't doing too bad,
> though.
> > >
> > > For inspecting the files, check out parquet-tools [1]. That can dump
> > > the metadata from a file all the way down to the page level. The "meta"
> > command
> > > will print out each row group and column within those row groups,
> > > which should give you the info you're looking for.
> > >
> > > rb
> > >
> > >
> > > [1]:
> > >
> > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> > t-tools%7C1.8.1%7Cjar
> > >
> > > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov
> > > <[email protected]
> > >
> > > wrote:
> > >
> > >> Hi guys,
> > >>
> > >> We’re evaluating Parquet as the high compression format for our
> > >> logs. We took some ~850Gb of TSV data (some columns are JSON) and
> > >> Parquet
> > >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> > (with
> > >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same
> data.
> > >>
> > >> So the questions are:
> > >>
> > >> 1) is this somewhat expected compression rate (compared to GZip)?
> > >> 2) As we specially crafted Parquet schema with maps and lists for
> > certain
> > >> fields, is there any tool to show the sizes of individual Parquet
> > columns
> > >> so we can find the biggest ones?
> > >>
> > >> Thanks in advance,
> > >> Kirill
> > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: achieving better compression with Parquet

Reply via email to