Antwins, Typical query for us is something like ‘Select events where [here come attributes constraints] and timestamp > 2016-03-16 and timestamp < 2016-03-17’, that’s why I’m asking if this query can benefit from timestamp ordering.
> On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote: > > Kirill, > > I would think that if such a capability is introduced it should be > `optional` as depending on your query patterns it might make more sense to > sort on another column. > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov <[email protected]> > wrote: > >> Thanks Ryan, >> >> One more question please: as we’re going to store timestamped events in >> Parquet, would it be beneficial to write the files chronologically sorted? >> Namely, will the query for the certain time range over the time-sorted >> Parquet file be optimised so that irrelevant portion of data is skipped and >> no "full scan" is done? >> >> Kirill >> >>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> wrote: >>> >>> Adding int64-delta should be weeks. We should also open a bug report for >>> that line in Spark. It should not fail if an annotation is unsupported. >> It >>> should ignore it. >>> >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov < >> [email protected]> >>> wrote: >>> >>>> Thanks for reply Ryan, >>>> >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. The >> format >>>>> 2.0 encodings include a delta-integer encoding that we expect to work >>>> really well for timestamps, but that hasn't been committed for int64 >> yet. >>>> >>>> Is there any ETA on when it can appear? Just the order e.g. weeks or >>>> months? >>>> >>>>> Also, it should be safe to store timestamps as int64 using the >>>> TIMESTAMP_MILLIS annotation. >>>> >>>> Unfortunately this is not the case for us as the Parquet complains with >>>> "Parquet type not yet supported" [1]. >>>> >>>> Thanks, >>>> Kirill >>>> >>>> [1]: >>>> >>>> >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161 >>>> >>>> -----Original Message----- >>>> From: Ryan Blue [mailto:[email protected]] >>>> Sent: Monday, March 14, 2016 7:44 PM >>>> To: Parquet Dev >>>> Subject: Re: achieving better compression with Parquet >>>> >>>> Kirill, >>>> >>>> For 1, the reported size is just the data size. That doesn't include >> page >>>> headers, statistics, or dictionary pages. You can see the size of the >>>> dictionary pages in the dump output, which I would expect is where the >>>> majority of the difference is. >>>> >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The >> format >>>> 2.0 encodings include a delta-integer encoding that we expect to work >>>> really well for timestamps, but that hasn't been committed for int64 >> yet. >>>> >>>> Also, it should be safe to store timestamps as int64 using the >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of what the >>>> values you write represent. When there isn't specific support for it, >> you >>>> should just get an int64. Using that annotation should give you the >> exact >>>> same behavior as not using it right now, but when you update to a >> version >>>> of Spark that supports it you should be able to get timestamps out of >> your >>>> existing data. >>>> >>>> rb >>>> >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov < >> [email protected]> >>>> wrote: >>>> >>>>> Thanks for the hint Ryan! >>>>> >>>>> I applied the tool to the file and I’ve got some more questions if you >>>>> don’t mind :-) >>>>> >>>>> 1) We’re using 64Mb page (row group) size so I would expect the sum of >>>>> all the values in “compressed size” field (which is {x} in >>>>> SZ:{x}/{y}/{z} >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this expected? >>>>> 2) One of the largest field is Unix timestamp (we may have lots of >>>>> timestamps for a single data record) which is written as plain int64 >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it seems to >>>>> be not yet supported by Spark). The tool says that this column is >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped >>>>> afterwards). Is this the most compact way to store timestamps or e.g. >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make an >>>> improvement? >>>>> >>>>> Thanks, >>>>> Kirill >>>>> >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> >> wrote: >>>>>> >>>>>> Hi Kirill, >>>>>> >>>>>> It's hard to say what the expected compression rate should be since >>>>> that's >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad, >>>> though. >>>>>> >>>>>> For inspecting the files, check out parquet-tools [1]. That can dump >>>>>> the metadata from a file all the way down to the page level. The >> "meta" >>>>> command >>>>>> will print out each row group and column within those row groups, >>>>>> which should give you the info you're looking for. >>>>>> >>>>>> rb >>>>>> >>>>>> >>>>>> [1]: >>>>>> >>>>> http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque >>>>> t-tools%7C1.8.1%7Cjar >>>>>> >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov >>>>>> <[email protected] >>>>>> >>>>>> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> We’re evaluating Parquet as the high compression format for our >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) and >>>>>>> Parquet >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain GZip >>>>> (with >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the same >>>> data. >>>>>>> >>>>>>> So the questions are: >>>>>>> >>>>>>> 1) is this somewhat expected compression rate (compared to GZip)? >>>>>>> 2) As we specially crafted Parquet schema with maps and lists for >>>>> certain >>>>>>> fields, is there any tool to show the sizes of individual Parquet >>>>> columns >>>>>>> so we can find the biggest ones? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Kirill >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>> >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >> >>
