Adding int64-delta should be weeks. We should also open a bug report for that line in Spark. It should not fail if an annotation is unsupported. It should ignore it.
On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov <[email protected]> wrote: > Thanks for reply Ryan, > > > For 2, PLAIN/gzip is the best option for timestamps right now. The format > > 2.0 encodings include a delta-integer encoding that we expect to work > really well for timestamps, but that hasn't been committed for int64 yet. > > Is there any ETA on when it can appear? Just the order e.g. weeks or > months? > > > Also, it should be safe to store timestamps as int64 using the > TIMESTAMP_MILLIS annotation. > > Unfortunately this is not the case for us as the Parquet complains with > "Parquet type not yet supported" [1]. > > Thanks, > Kirill > > [1]: > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161 > > -----Original Message----- > From: Ryan Blue [mailto:[email protected]] > Sent: Monday, March 14, 2016 7:44 PM > To: Parquet Dev > Subject: Re: achieving better compression with Parquet > > Kirill, > > For 1, the reported size is just the data size. That doesn't include page > headers, statistics, or dictionary pages. You can see the size of the > dictionary pages in the dump output, which I would expect is where the > majority of the difference is. > > For 2, PLAIN/gzip is the best option for timestamps right now. The format > 2.0 encodings include a delta-integer encoding that we expect to work > really well for timestamps, but that hasn't been committed for int64 yet. > > Also, it should be safe to store timestamps as int64 using the > TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the > values you write represent. When there isn't specific support for it, you > should just get an int64. Using that annotation should give you the exact > same behavior as not using it right now, but when you update to a version > of Spark that supports it you should be able to get timestamps out of your > existing data. > > rb > > On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov <[email protected]> > wrote: > > > Thanks for the hint Ryan! > > > > I applied the tool to the file and I’ve got some more questions if you > > don’t mind :-) > > > > 1) We’re using 64Mb page (row group) size so I would expect the sum of > > all the values in “compressed size” field (which is {x} in > > SZ:{x}/{y}/{z} > > notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected? > > 2) One of the largest field is Unix timestamp (we may have lots of > > timestamps for a single data record) which is written as plain int64 > > (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to > > be not yet supported by Spark). The tool says that this column is > > stored with “ENC:PLAIN” encoding (which I suppose is GZipped > > afterwards). Is this the most compact way to store timestamps or e.g. > > giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an > improvement? > > > > Thanks, > > Kirill > > > > > On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> wrote: > > > > > > Hi Kirill, > > > > > > It's hard to say what the expected compression rate should be since > > that's > > > heavily data-dependent. Sounds like Parquet isn't doing too bad, > though. > > > > > > For inspecting the files, check out parquet-tools [1]. That can dump > > > the metadata from a file all the way down to the page level. The "meta" > > command > > > will print out each row group and column within those row groups, > > > which should give you the info you're looking for. > > > > > > rb > > > > > > > > > [1]: > > > > > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque > > t-tools%7C1.8.1%7Cjar > > > > > > On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov > > > <[email protected] > > > > > > wrote: > > > > > >> Hi guys, > > >> > > >> We’re evaluating Parquet as the high compression format for our > > >> logs. We took some ~850Gb of TSV data (some columns are JSON) and > > >> Parquet > > >> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip > > (with > > >> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same > data. > > >> > > >> So the questions are: > > >> > > >> 1) is this somewhat expected compression rate (compared to GZip)? > > >> 2) As we specially crafted Parquet schema with maps and lists for > > certain > > >> fields, is there any tool to show the sizes of individual Parquet > > columns > > >> so we can find the biggest ones? > > >> > > >> Thanks in advance, > > >> Kirill > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > -- > Ryan Blue > Software Engineer > Netflix > > -- Ryan Blue Software Engineer Netflix
