Right now, the spec supports columns in separate files, but I don't think that the implementation does. It wouldn't be too hard to make that work, but I don't think it does today.
For predicate push-down in Spark, I've gotten it working and will be getting the patches into upstream Spark. It mostly works now with a few settings, but has disabled string/binary stats filtering because of PARQUET-251. I am also trying to get a few important patches in to help when writing Parquet files and avoid OOMs. rb On Sat, May 7, 2016 at 2:39 PM, Kirill Safonov <[email protected]> wrote: > Hi Ryan, guys, > > Let me please follow up on your last answer. Parquet file can be physically > stored as a single file (written via WriteSupport) or as a folder with a > collection of "parallel" files (generated by map-reduce or Spark > via ParquetOutputFormat). > > Will a Spark task processing Parquet input benefit equally from min/max > stats for both cases (single file vs folder)? > > Thanks, > Kirill > > On Wed, Mar 16, 2016 at 8:30 PM, Ryan Blue <[email protected]> > wrote: > > > Kirill, > > > > Yes, sorting data by the columns you intend to filter by will definitely > > help query performance because we keep min/max stats for each column > chunk > > and page that are used to eliminate row groups when you are passing > filters > > into Parquet. > > > > rb > > > > On Wed, Mar 16, 2016 at 1:07 AM, Kirill Safonov < > [email protected]> > > wrote: > > > > > Antwins, > > > > > > Typical query for us is something like ‘Select events where [here come > > > attributes constraints] and timestamp > 2016-03-16 and timestamp < > > > 2016-03-17’, that’s why I’m asking if this query can benefit from > > timestamp > > > ordering. > > > > > > > On 16 Mar 2016, at 03:03, Antwnis <[email protected]> wrote: > > > > > > > > Kirill, > > > > > > > > I would think that if such a capability is introduced it should be > > > > `optional` as depending on your query patterns it might make more > sense > > > to > > > > sort on another column. > > > > > > > > On Tue, Mar 15, 2016 at 10:18 PM, Kirill Safonov < > > > [email protected]> > > > > wrote: > > > > > > > >> Thanks Ryan, > > > >> > > > >> One more question please: as we’re going to store timestamped events > > in > > > >> Parquet, would it be beneficial to write the files chronologically > > > sorted? > > > >> Namely, will the query for the certain time range over the > time-sorted > > > >> Parquet file be optimised so that irrelevant portion of data is > > skipped > > > and > > > >> no "full scan" is done? > > > >> > > > >> Kirill > > > >> > > > >>> On 14 Mar 2016, at 22:00, Ryan Blue <[email protected]> > > wrote: > > > >>> > > > >>> Adding int64-delta should be weeks. We should also open a bug > report > > > for > > > >>> that line in Spark. It should not fail if an annotation is > > unsupported. > > > >> It > > > >>> should ignore it. > > > >>> > > > >>> On Mon, Mar 14, 2016 at 10:11 AM, Kirill Safonov < > > > >> [email protected]> > > > >>> wrote: > > > >>> > > > >>>> Thanks for reply Ryan, > > > >>>> > > > >>>>> For 2, PLAIN/gzip is the best option for timestamps right now. > The > > > >> format > > > >>>>> 2.0 encodings include a delta-integer encoding that we expect to > > work > > > >>>> really well for timestamps, but that hasn't been committed for > int64 > > > >> yet. > > > >>>> > > > >>>> Is there any ETA on when it can appear? Just the order e.g. weeks > or > > > >>>> months? > > > >>>> > > > >>>>> Also, it should be safe to store timestamps as int64 using the > > > >>>> TIMESTAMP_MILLIS annotation. > > > >>>> > > > >>>> Unfortunately this is not the case for us as the Parquet complains > > > with > > > >>>> "Parquet type not yet supported" [1]. > > > >>>> > > > >>>> Thanks, > > > >>>> Kirill > > > >>>> > > > >>>> [1]: > > > >>>> > > > >>>> > > > >> > > > > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L161 > > > >>>> > > > >>>> -----Original Message----- > > > >>>> From: Ryan Blue [mailto:[email protected]] > > > >>>> Sent: Monday, March 14, 2016 7:44 PM > > > >>>> To: Parquet Dev > > > >>>> Subject: Re: achieving better compression with Parquet > > > >>>> > > > >>>> Kirill, > > > >>>> > > > >>>> For 1, the reported size is just the data size. That doesn't > include > > > >> page > > > >>>> headers, statistics, or dictionary pages. You can see the size of > > the > > > >>>> dictionary pages in the dump output, which I would expect is where > > the > > > >>>> majority of the difference is. > > > >>>> > > > >>>> For 2, PLAIN/gzip is the best option for timestamps right now. The > > > >> format > > > >>>> 2.0 encodings include a delta-integer encoding that we expect to > > work > > > >>>> really well for timestamps, but that hasn't been committed for > int64 > > > >> yet. > > > >>>> > > > >>>> Also, it should be safe to store timestamps as int64 using the > > > >>>> TIMESTAMP_MILLIS annotation. That's just a way to keep track of > what > > > the > > > >>>> values you write represent. When there isn't specific support for > > it, > > > >> you > > > >>>> should just get an int64. Using that annotation should give you > the > > > >> exact > > > >>>> same behavior as not using it right now, but when you update to a > > > >> version > > > >>>> of Spark that supports it you should be able to get timestamps out > > of > > > >> your > > > >>>> existing data. > > > >>>> > > > >>>> rb > > > >>>> > > > >>>> On Mon, Mar 7, 2016 at 3:29 PM, Kirill Safonov < > > > >> [email protected]> > > > >>>> wrote: > > > >>>> > > > >>>>> Thanks for the hint Ryan! > > > >>>>> > > > >>>>> I applied the tool to the file and I’ve got some more questions > if > > > you > > > >>>>> don’t mind :-) > > > >>>>> > > > >>>>> 1) We’re using 64Mb page (row group) size so I would expect the > sum > > > of > > > >>>>> all the values in “compressed size” field (which is {x} in > > > >>>>> SZ:{x}/{y}/{z} > > > >>>>> notation) to be around 64 Mb, but it’s near 48 Mb. Is this > > expected? > > > >>>>> 2) One of the largest field is Unix timestamp (we may have lots > of > > > >>>>> timestamps for a single data record) which is written as plain > > int64 > > > >>>>> (we refrained from using OriginalType.TIMESTAMP_MILLIS as it > seems > > to > > > >>>>> be not yet supported by Spark). The tool says that this column is > > > >>>>> stored with “ENC:PLAIN” encoding (which I suppose is GZipped > > > >>>>> afterwards). Is this the most compact way to store timestamps or > > e.g. > > > >>>>> giving a "OriginalType.TIMESTAMP_MILLIS” or other hint will make > an > > > >>>> improvement? > > > >>>>> > > > >>>>> Thanks, > > > >>>>> Kirill > > > >>>>> > > > >>>>>> On 07 Mar 2016, at 00:26, Ryan Blue <[email protected]> > > > >> wrote: > > > >>>>>> > > > >>>>>> Hi Kirill, > > > >>>>>> > > > >>>>>> It's hard to say what the expected compression rate should be > > since > > > >>>>> that's > > > >>>>>> heavily data-dependent. Sounds like Parquet isn't doing too bad, > > > >>>> though. > > > >>>>>> > > > >>>>>> For inspecting the files, check out parquet-tools [1]. That can > > dump > > > >>>>>> the metadata from a file all the way down to the page level. The > > > >> "meta" > > > >>>>> command > > > >>>>>> will print out each row group and column within those row > groups, > > > >>>>>> which should give you the info you're looking for. > > > >>>>>> > > > >>>>>> rb > > > >>>>>> > > > >>>>>> > > > >>>>>> [1]: > > > >>>>>> > > > >>>>> > > > http://search.maven.org/#artifactdetails%7Corg.apache.parquet%7Cparque > > > >>>>> t-tools%7C1.8.1%7Cjar > > > >>>>>> > > > >>>>>> On Sun, Mar 6, 2016 at 7:37 AM, Kirill Safonov > > > >>>>>> <[email protected] > > > >>>>>> > > > >>>>>> wrote: > > > >>>>>> > > > >>>>>>> Hi guys, > > > >>>>>>> > > > >>>>>>> We’re evaluating Parquet as the high compression format for our > > > >>>>>>> logs. We took some ~850Gb of TSV data (some columns are JSON) > and > > > >>>>>>> Parquet > > > >>>>>>> (CompressionCodec.GZIP) gave us 6.8x compression whereas plain > > GZip > > > >>>>> (with > > > >>>>>>> Deflater.BEST_COMPRESSION) gave 4.9x (~1.4 times less) on the > > same > > > >>>> data. > > > >>>>>>> > > > >>>>>>> So the questions are: > > > >>>>>>> > > > >>>>>>> 1) is this somewhat expected compression rate (compared to > GZip)? > > > >>>>>>> 2) As we specially crafted Parquet schema with maps and lists > for > > > >>>>> certain > > > >>>>>>> fields, is there any tool to show the sizes of individual > Parquet > > > >>>>> columns > > > >>>>>>> so we can find the biggest ones? > > > >>>>>>> > > > >>>>>>> Thanks in advance, > > > >>>>>>> Kirill > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Ryan Blue > > > >>>>>> Software Engineer > > > >>>>>> Netflix > > > >>>>> > > > >>>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> Ryan Blue > > > >>>> Software Engineer > > > >>>> Netflix > > > >>>> > > > >>>> > > > >>> > > > >>> > > > >>> -- > > > >>> Ryan Blue > > > >>> Software Engineer > > > >>> Netflix > > > >> > > > >> > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > kir > -- Ryan Blue Software Engineer Netflix
