Kirill, Yes, sorting data by the columns you intend to filter by will definitely help query performance because we keep min/max stats for each column chunk and page that are used to eliminate row groups when you are passing filters into Parquet.
rb On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov <[email protected]> wrote: > Antwins, > > Typical query for us is something like ‘Select events where [here come > attributes constraints] and timestamp > 2016-03-16 and timestamp < > 2016-03-17’, that’s why I’m asking if this query can benefit from timestamp > ordering. > > > On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote: > > > > Kirill, > > > > I would think that if such a capability is introduced it should be > > `optional` as depending on your query patterns it might make more sense > to > > sort on another column. > > > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov < > [email protected]> > > wrote: > > > >> Thanks Ryan, > >> > >> One more question please: as we’re going to store timestamped events in > >> Parquet, would it be beneficial to write the files chronologically > sorted? > >> Namely, will the query for the certain time range over the time-sorted > >> Parquet file be optimised so that irrelevant portion of data is skipped > and > >> no "full scan" is done? > >> > >> Kirill > >> > >>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote: > >>> > >>> Adding int64-delta should be weeks. We should also open a bug report > for > >>> that line in Spark. It should not fail if an annotation is unsupported. > >> It > >>> should ignore it. > >>> > >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov < > >> [email protected]> > >>> wrote: > >>> > >>>> Thanks for reply Ryan, > >>>> > >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The > >> format > >>>>> 2.0 encodings include a delta-integer encoding that we expect to work > >>>> really well for timestamps, but that hasn't been committed for int64 > >> yet. > >>>> > >>>> Is there any ETA on when it can appear? Just the order e.g. weeks or > >>>> months? > >>>> > >>>>> Also, it should be safe to store timestamps as int64 using the > >>>> TIMESTAMP_MILLIS annotation. > >>>> > >>>> Unfortunately this is not the case for us as the Parquet complains > with > >>>> "Parquet type not yet supported" [1]. > >>>> > >>>> Thanks, > >>>> Kirill > >>>> > >>>> [1]: > >>>> > >>>> > >> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161 > >>>> > >>>> -----Original Message----- > >>>> From: Ryan Blue [mailto:[email protected]] > >>>> Sent: Monday, March 14, 2016 7:44 PM > >>>> To: Parquet Dev > >>>> Subject: Re: achieving better compression with Parquet > >>>> > >>>> Kirill, > >>>> > >>>> For 1, the reported size is just the data size. That doesn't include > >> page > >>>> headers, statistics, or dictionary pages. You can see the size of the > >>>> dictionary pages in the dump output, which I would expect is where the > >>>> majority of the difference is. > >>>> > >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The > >> format > >>>> 2.0 encodings include a delta-integer encoding that we expect to work > >>>> really well for timestamps, but that hasn't been committed for int64 > >> yet. > >>>> > >>>> Also, it should be safe to store timestamps as int64 using the > >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what > the > >>>> values you write represent. When there isn't specific support for it, > >> you > >>>> should just get an int64. Using that annotation should give you the > >> exact > >>>> same behavior as not using it right now, but when you update to a > >> version > >>>> of Spark that supports it you should be able to get timestamps out of > >> your > >>>> existing data. > >>>> > >>>> rb > >>>> > >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov < > >> [email protected]> > >>>> wrote: > >>>> > >>>>> Thanks for the hint Ryan! > >>>>> > >>>>> I applied the tool to the file and I’ve got some more questions if > you > >>>>> don’t mind :-) > >>>>> > >>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum > of > >>>>> all the values in “compressed size” field (which is {x} in > >>>>> SZ:{x}/{y}/{z} > >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected? > >>>>> 2) One of the largest field is Unix timestamp (we may have lots of > >>>>> timestamps for a single data record) which is written as plain int64 > >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to > >>>>> be not yet supported by Spark). The tool says that this column is > >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped > >>>>> afterwards). Is this the most compact way to store timestamps or e.g. > >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an > >>>> improvement? > >>>>> > >>>>> Thanks, > >>>>> Kirill > >>>>> > >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> > >> wrote: > >>>>>> > >>>>>> Hi Kirill, > >>>>>> > >>>>>> It's hard to say what the expected compression rate should be since > >>>>> that's > >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad, > >>>> though. > >>>>>> > >>>>>> For inspecting the files, check out parquet-tools [1]. That can dump > >>>>>> the metadata from a file all the way down to the page level. The > >> "meta" > >>>>> command > >>>>>> will print out each row group and column within those row groups, > >>>>>> which should give you the info you're looking for. > >>>>>> > >>>>>> rb > >>>>>> > >>>>>> > >>>>>> [1]: > >>>>>> > >>>>> > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque > >>>>> t-tools%7C1.8.1%7Cjar > >>>>>> > >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov > >>>>>> <[email protected] > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi guys, > >>>>>>> > >>>>>>> We’re evaluating Parquet as the high compression format for our > >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and > >>>>>>> Parquet > >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip > >>>>> (with > >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same > >>>> data. > >>>>>>> > >>>>>>> So the questions are: > >>>>>>> > >>>>>>> 1) is this somewhat expected compression rate (compared to GZip)? > >>>>>>> 2) As we specially crafted Parquet schema with maps and lists for > >>>>> certain > >>>>>>> fields, is there any tool to show the sizes of individual Parquet > >>>>> columns > >>>>>>> so we can find the biggest ones? > >>>>>>> > >>>>>>> Thanks in advance, > >>>>>>> Kirill > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Ryan Blue > >>>>>> Software Engineer > >>>>>> Netflix > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Ryan Blue > >>>> Software Engineer > >>>> Netflix > >>>> > >>>> > >>> > >>> > >>> -- > >>> Ryan Blue > >>> Software Engineer > >>> Netflix > >> > >> > > -- Ryan Blue Software Engineer Netflix
