Kirill, I would think that if such a capability is introduced it should be `optional` as depending on your query patterns it might make more sense to sort on another column.
On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <[email protected]> wrote: > Thanks Ryan, > > One more question please: as we’re going to store timestamped events in > Parquet, would it be beneficial to write the files chronologically sorted? > Namely, will the query for the certain time range over the time-sorted > Parquet file be optimised so that irrelevant portion of data is skipped and > no "full scan" is done? > > Kirill > > > On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote: > > > > Adding int64-delta should be weeks. We should also open a bug report for > > that line in Spark. It should not fail if an annotation is unsupported. > It > > should ignore it. > > > > On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov < > [email protected]> > > wrote: > > > >> Thanks for reply Ryan, > >> > >>> For 2, PLAIN/gzip is the best option for timestamps right now. The > format > >>> 2.0 encodings include a delta-integer encoding that we expect to work > >> really well for timestamps, but that hasn't been committed for int64 > yet. > >> > >> Is there any ETA on when it can appear? Just the order e.g. weeks or > >> months? > >> > >>> Also, it should be safe to store timestamps as int64 using the > >> TIMESTAMP_MILLIS annotation. > >> > >> Unfortunately this is not the case for us as the Parquet complains with > >> "Parquet type not yet supported" [1]. > >> > >> Thanks, > >> Kirill > >> > >> [1]: > >> > >> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161 > >> > >> -----Original Message----- > >> From: Ryan Blue [mailto:[email protected]] > >> Sent: Monday, March 14, 2016 7:44 PM > >> To: Parquet Dev > >> Subject: Re: achieving better compression with Parquet > >> > >> Kirill, > >> > >> For 1, the reported size is just the data size. That doesn't include > page > >> headers, statistics, or dictionary pages. You can see the size of the > >> dictionary pages in the dump output, which I would expect is where the > >> majority of the difference is. > >> > >> For 2, PLAIN/gzip is the best option for timestamps right now. The > format > >> 2.0 encodings include a delta-integer encoding that we expect to work > >> really well for timestamps, but that hasn't been committed for int64 > yet. > >> > >> Also, it should be safe to store timestamps as int64 using the > >> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the > >> values you write represent. When there isn't specific support for it, > you > >> should just get an int64. Using that annotation should give you the > exact > >> same behavior as not using it right now, but when you update to a > version > >> of Spark that supports it you should be able to get timestamps out of > your > >> existing data. > >> > >> rb > >> > >> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov < > [email protected]> > >> wrote: > >> > >>> Thanks for the hint Ryan! > >>> > >>> I applied the tool to the file and I’ve got some more questions if you > >>> don’t mind :-) > >>> > >>> 1) We’re using 64Mb page (row group) size so I would expect the sum of > >>> all the values in “compressed size” field (which is {x} in > >>> SZ:{x}/{y}/{z} > >>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected? > >>> 2) One of the largest field is Unix timestamp (we may have lots of > >>> timestamps for a single data record) which is written as plain int64 > >>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to > >>> be not yet supported by Spark). The tool says that this column is > >>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped > >>> afterwards). Is this the most compact way to store timestamps or e.g. > >>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an > >> improvement? > >>> > >>> Thanks, > >>> Kirill > >>> > >>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> > wrote: > >>>> > >>>> Hi Kirill, > >>>> > >>>> It's hard to say what the expected compression rate should be since > >>> that's > >>>> heavily data-dependent. Sounds like Parquet isn't doing too bad, > >> though. > >>>> > >>>> For inspecting the files, check out parquet-tools [1]. That can dump > >>>> the metadata from a file all the way down to the page level. The > "meta" > >>> command > >>>> will print out each row group and column within those row groups, > >>>> which should give you the info you're looking for. > >>>> > >>>> rb > >>>> > >>>> > >>>> [1]: > >>>> > >>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque > >>> t-tools%7C1.8.1%7Cjar > >>>> > >>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov > >>>> <[email protected] > >>>> > >>>> wrote: > >>>> > >>>>> Hi guys, > >>>>> > >>>>> We’re evaluating Parquet as the high compression format for our > >>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and > >>>>> Parquet > >>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip > >>> (with > >>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same > >> data. > >>>>> > >>>>> So the questions are: > >>>>> > >>>>> 1) is this somewhat expected compression rate (compared to GZip)? > >>>>> 2) As we specially crafted Parquet schema with maps and lists for > >>> certain > >>>>> fields, is there any tool to show the sizes of individual Parquet > >>> columns > >>>>> so we can find the biggest ones? > >>>>> > >>>>> Thanks in advance, > >>>>> Kirill > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Ryan Blue > >>>> Software Engineer > >>>> Netflix > >>> > >>> > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > >> > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > >
