RE: achieving better compression with Parquet

Kirill Safonov Mon, 14 Mar 2016 10:14:01 -0700

Thanks for reply Ryan,

> For 2, PLAIN/gzip is the best option for timestamps right now. The format
> 2.0 encodings include a delta-integer encoding that we expect to work really 
> well for timestamps, but that hasn't been committed for int64 yet.


Is there any ETA on when it can appear? Just the order e.g. weeks or months?

> Also, it should be safe to store timestamps as int64 using the 
> TIMESTAMP_MILLIS annotation.

Unfortunately this is not the case for us as the Parquet complains with 
"Parquet type not yet supported" [1].

Thanks,
 Kirill

[1]:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161

-----Original Message-----
From: Ryan Blue [mailto:[email protected]] 
Sent: Monday, March 14, 2016 7:44 PM
To: Parquet Dev
Subject: Re: achieving better compression with Parquet

Kirill,

For 1, the reported size is just the data size. That doesn't include page 
headers, statistics, or dictionary pages. You can see the size of the 
dictionary pages in the dump output, which I would expect is where the majority 
of the difference is.

For 2, PLAIN/gzip is the best option for timestamps right now. The format
2.0 encodings include a delta-integer encoding that we expect to work really 
well for timestamps, but that hasn't been committed for int64 yet.

Also, it should be safe to store timestamps as int64 using the TIMESTAMP_MILLIS 
annotation. That's just a way to keep track of what the values you write 
represent. When there isn't specific support for it, you should just get an 
int64. Using that annotation should give you the exact same behavior as not 
using it right now, but when you update to a version of Spark that supports it 
you should be able to get timestamps out of your existing data.

rb

On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <[email protected]>
wrote:

> Thanks for the hint Ryan!
>
> I applied the tool to the file and I’ve got some more questions if you 
> don’t mind :-)
>
> 1) We’re using 64Mb page (row group) size so I would expect the sum of 
> all the values in “compressed size” field (which is {x} in 
> SZ:{x}/{y}/{z}
> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected?
> 2) One of the largest field is Unix timestamp (we may have lots of 
> timestamps for a single data record) which is written as plain int64 
> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to 
> be not yet supported by Spark). The tool says that this column is 
> stored with “ENC:PLAIN” encoding (which I suppose is GZipped 
> afterwards). Is this the most compact way to store timestamps or e.g. 
> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an 
> improvement?
>
> Thanks,
>  Kirill
>
> > On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote:
> >
> > Hi Kirill,
> >
> > It's hard to say what the expected compression rate should be since
> that's
> > heavily data-dependent. Sounds like Parquet isn't doing too bad, though.
> >
> > For inspecting the files, check out parquet-tools [1]. That can dump 
> > the metadata from a file all the way down to the page level. The "meta"
> command
> > will print out each row group and column within those row groups, 
> > which should give you the info you're looking for.
> >
> > rb
> >
> >
> > [1]:
> >
> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque
> t-tools%7C1.8.1%7Cjar
> >
> > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov 
> > <[email protected]
> >
> > wrote:
> >
> >> Hi guys,
> >>
> >> We’re evaluating Parquet as the high compression format for our 
> >> logs. We took some ~850Gb of TSV data (some columns are JSON) and 
> >> Parquet
> >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip
> (with
> >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same data.
> >>
> >> So the questions are:
> >>
> >> 1) is this somewhat expected compression rate (compared to GZip)?
> >> 2) As we specially crafted Parquet schema with maps and lists for
> certain
> >> fields, is there any tool to show the sizes of individual Parquet
> columns
> >> so we can find the biggest ones?
> >>
> >> Thanks in advance,
> >> Kirill
> >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


--
Ryan Blue
Software Engineer
Netflix

RE: achieving better compression with Parquet

Reply via email to