Hi Ryan, We have now been using Parquet (w/Drill on local file system and in Hadoop+Spark context) for a while and we love it.
There are a few things that I would like to get your feedback on if you have the time. - Rough set indexes for segments or any other alternatives on the Parquet roadmap (IndexR / Inforbright) - I remember some bloom filter checks that did not, I think, materialize or were not as beneficial as first intended - Full predicate push-down in Drill and what performance improvements it might unlock - Delta encoding of timestamps (bigint) is available now in Parquet, right? Is there a roadmap available somewhere or are there any specific improvements in speed/compression that I would love to know about? I hope we can start contributing later this year as we would love to also be on that end :). Thank you again, the Parquet team has made our life a lot easier (S3+Presto is another use-case we love) Regards, -Stefán On Mon, May 30, 2016 at 4:33 PM, Ryan Blue <rb...@netflix.com.invalid> wrote: > I think the easiest place is to look for new artifacts in maven central: > > http://search.maven.org/#search|ga|1|org.apache.parquet > > I'm not sure why the download page has 2.1, we should fix that. > > rb > > On Sun, May 29, 2016 at 7:28 AM, Stefán Baxter <ste...@activitystream.com> > wrote: > > > Hi Ryan, > > > > Where should I be monitoring these releases. > > > > I see that https://parquet.apache.org/downloads/ references version 2.1 > > and that confuses me a bit :). > > > > Regards, > > -Stefán > > > > On Thu, Apr 14, 2016 at 3:57 PM, Ryan Blue <rb...@netflix.com.invalid> > > wrote: > > > > > Stefán, > > > > > > It's common for your Hadoop distribution vendor to backport the > features > > > you need in their next release, so checking with them is a good start. > > > > > > Right now, we're running these tests on 1.8.1 with dictionary filtering > > > backported. I think the goal for the Parquet community is to get the > > 1.9.0 > > > release out without maintaining a 1.8.x line, but we could put out a > > 1.8.2 > > > if there's interest and the 1.9.0 branch is blocked for a long time. > This > > > is a good thing to discuss at the next Parquet sync-up. > > > > > > rb > > > > > > On Thu, Apr 14, 2016 at 8:45 AM, Stefán Baxter < > > ste...@activitystream.com> > > > wrote: > > > > > > > Hi Ryan, > > > > > > > > What tools do you use when running queries on 1.9 while it's in > > > > development, it that Pig only? > > > > > > > > I ask because I'm quite curios to what performance improvements we > > might > > > > gain with the new dictionary filters. > > > > > > > > Regards, > > > > -Stefán > > > > > > > > On Tue, Apr 12, 2016 at 6:57 PM, Ryan Blue <rb...@netflix.com.invalid > > > > > > wrote: > > > > > > > > > I'm not very qualified to answer comparison questions like that. > You > > > may > > > > > want to check with the Apache Drill community for their performance > > > > > numbers. > > > > > > > > > > On Tue, Apr 12, 2016 at 10:46 AM, Stefán Baxter < > > > > ste...@activitystream.com > > > > > > > > > > > wrote: > > > > > > > > > > > Great, thank you. > > > > > > > > > > > > Has there a comparison been made between Drill and Presto in > > regards > > > to > > > > > > Parquet and low-latency queries? > > > > > > > > > > > > Regards, > > > > > > -Stefan > > > > > > > > > > > > On Tue, Apr 12, 2016 at 5:41 PM, Ryan Blue > > <rb...@netflix.com.invalid > > > > > > > > > > wrote: > > > > > > > > > > > > > The 1.9 release is about ready, but blocked on a fix to the > read > > > > path. > > > > > We > > > > > > > recently switched to a ByteBuffer-based read path and we > > identified > > > > > some > > > > > > > problems that require further testing. You can follow > PARQUET-400 > > > for > > > > > it. > > > > > > > I'm not sure what the timeline for this fix is. > > > > > > > > > > > > > > As far as "staging format", I always recommend Avro since there > > is > > > > good > > > > > > > support for the Avro object model on top of both formats. > Avro's > > > > also a > > > > > > > Hadoop format (splittable) if you need to process data before > > > moving > > > > it > > > > > > to > > > > > > > Parquet for long-term storage. > > > > > > > > > > > > > > Your multi-column situation sounds somewhat reasonable. You're > > > > probably > > > > > > > seeking before each column read. You could try grouping columns > > > that > > > > > will > > > > > > > be selected together in the schema. We could also improve how > > > Parquet > > > > > > reads > > > > > > > data here, which is currently pretty simple. It will read > > > contiguous > > > > > > > chunks, but a gap of even 1 byte will cause a seek operation. > We > > > > could > > > > > > have > > > > > > > a threshold that groups read operations together if they aren't > > > > > separated > > > > > > > by at least that many bytes. > > > > > > > > > > > > > > We currently use Presto for low-latency queries, Pig for ETL, > > Spark > > > > for > > > > > > ETL > > > > > > > and other tasks, and Hive for ETL and batch SQL. We're > interested > > > in > > > > > > > getting the dictionary row group filter support in all of those > > > since > > > > > > we've > > > > > > > seen great results with it. > > > > > > > > > > > > > > rb > > > > > > > > > > > > > > On Tue, Apr 12, 2016 at 10:29 AM, Stefán Baxter < > > > > > > ste...@activitystream.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks Ryan, > > > > > > > > > > > > > > > > Very exciting news! When is 1.9 expected to become available? > > > > > (roughly) > > > > > > > > > > > > > > > > I sure hope the Avro stuff makes it into parquet-avro :) > > > > > > > > > > > > > > > > Is there a preferred "staging" format other than Avro that > you > > > > would > > > > > > > > recommend? > > > > > > > > > > > > > > > > Regarding the multi-column overhead: > > > > > > > > > > > > > > > > - I have a single table (no joins) in 1 parquet files > > > > > > > > with multiple segment files (8 segment files) > > > > > > > > - In my test I have ~2 million entries in a fairly wide > > table > > > > (150 > > > > > > > > columns) > > > > > > > > - Selecting a single column + count(*) with a simple where > > > > clause > > > > > > (On > > > > > > > > two fields, neither returned) takes ~1 second to run > > > > > > > > - When I run the same query with 8 columns selected and > > > grouped > > > > > the > > > > > > > > query takes ~ 8 seconds > > > > > > > > - The only overhead involved should be fetching values > from > > > > other > > > > > > > > columns and grouping+aggregating > > > > > > > > > > > > > > > > What environment/tools are you using when querying Parquet? > > > > > > > > > > > > > > > > Regards, > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 12, 2016 at 5:11 PM, Ryan Blue > > > > <rb...@netflix.com.invalid > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Stefán, > > > > > > > > > > > > > > > > > > I can't speak for when features are going to be available > to > > > > Drill, > > > > > > but > > > > > > > > for > > > > > > > > > Parquet, I can give you a few updates: > > > > > > > > > > > > > > > > > > * Time-stamp support (bigint+delta_encoding): This is going > > to > > > be > > > > > in > > > > > > > the > > > > > > > > > 1.9.0 release. Also, parquet-avro is waiting on a review > for > > > > > > supporting > > > > > > > > > timestamp types, which may not make it in. > > > > > > > > > > > > > > > > > > * Predicate-Pushdown for dictionary values: A row group > > filter > > > > that > > > > > > > uses > > > > > > > > > dictionaries has been committed and will be in 1.9.0 as > well. > > > > It's > > > > > > > > working > > > > > > > > > great for Pig jobs in our environment already. This > requires > > > > > setting > > > > > > a > > > > > > > > > property to enable it, > > parquet.filter.dictionary.enabled=true. > > > > > > > > > > > > > > > > > > * Bloom filters: The predicate support for dictionaries > > > removes a > > > > > lot > > > > > > > of > > > > > > > > > the need for bloom filters. I haven't done the calculations > > > yet, > > > > > but > > > > > > > > > there's a narrow margin of % unique where bloom filters are > > > > > valuable > > > > > > > and > > > > > > > > > the column isn't dictionary-encoded. We've not seen an > > example > > > in > > > > > our > > > > > > > > data > > > > > > > > > where we can't solve the problem by making dictionaries > > larger. > > > > You > > > > > > can > > > > > > > > > follow this at PARQUET-41 [1]. > > > > > > > > > > > > > > > > > > * Multi-column overhead: Can you give us more information > > here? > > > > > What > > > > > > > did > > > > > > > > > the two tables look like? It could be that you're adding > more > > > > seeks > > > > > > to > > > > > > > > pull > > > > > > > > > down the data in the wide table case, but I'm not sure > > without > > > > more > > > > > > > info. > > > > > > > > > > > > > > > > > > rb > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/PARQUET-41 > > > > > > > > > > > > > > > > > > On Mon, Apr 11, 2016 at 12:32 PM, Stefán Baxter < > > > > > > > > ste...@activitystream.com > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > We are using Parquet with Drill and we are quite happy, > > thank > > > > you > > > > > > all > > > > > > > > > very > > > > > > > > > > much. > > > > > > > > > > > > > > > > > > > > We use Drill to query it and I wonder if there are some > > sort > > > of > > > > > > best > > > > > > > > > > practices, recommended setup or any tips you could share. > > > > > > > > > > > > > > > > > > > > I also wanted to ask about some of the thing we > think/hope > > > are > > > > in > > > > > > > scope > > > > > > > > > and > > > > > > > > > > what affect they will have on performance. > > > > > > > > > > > > > > > > > > > > *Time-stamp support (bigint+delta_encoding) * > > > > > > > > > > We are using Avro for inbound/fresh data and I believe > 1.8 > > > > > finally > > > > > > > has > > > > > > > > > > date/timestamp support and I wonder when Parquet will > > support > > > > > > > > > > timestamp+mills in a more efficient (encoded) way. > > > > > > > > > > > > > > > > > > > > *Predicate-Pushdown for dictionary values* > > > > > > > > > > I hope I'm using the right terms but I'm basically > > referring > > > to > > > > > the > > > > > > > > > ability > > > > > > > > > > to skip segments if a the value being searched is not in > > the > > > > > > > dictionary > > > > > > > > > for > > > > > > > > > > that segment (when/if dictionary encoding is used). I may > > be > > > > > wrong > > > > > > in > > > > > > > > > > thinking that this will speed up our queries quite a bit > > but > > > I > > > > > > think > > > > > > > > our > > > > > > > > > > date and some of our queries would. > > > > > > > > > > > > > > > > > > > > *Bloom Filters* > > > > > > > > > > I monitored some discussion here on implementing bloom > > > filters > > > > > and > > > > > > > some > > > > > > > > > > initial tests that were done to assess possible benefits. > > How > > > > did > > > > > > > that > > > > > > > > > go? > > > > > > > > > > (Meaning will it be done and are there any initial > numbers > > > > > > regarding > > > > > > > > > > potential gain) > > > > > > > > > > > > > > > > > > > > *Multi column overhead* > > > > > > > > > > We are seeing that queries that fetch values from many > > > columns > > > > > are > > > > > > a > > > > > > > > lot > > > > > > > > > > slower then the "same" queries when run with only a few > > > > columns. > > > > > > This > > > > > > > > is > > > > > > > > > to > > > > > > > > > > be expected but I wonder if there are any tricks/tips > > > available > > > > > > here. > > > > > > > > We > > > > > > > > > > are, for example, using nested structures that could be > > > > flattened > > > > > > but > > > > > > > > > that > > > > > > > > > > seems irrelevant. > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Ryan Blue > > > > > > > > > Software Engineer > > > > > > > > > Netflix > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Ryan Blue > > > > > > > Software Engineer > > > > > > > Netflix > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Ryan Blue > > > > > Software Engineer > > > > > Netflix > > > > > > > > > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > > > > > -- > Ryan Blue > Software Engineer > Netflix >