I'm not very qualified to answer comparison questions like that. You may want to check with the Apache Drill community for their performance numbers.
On Tue, Apr 12, 2016 at 10:46 AM, Stefán Baxter <ste...@activitystream.com> wrote: > Great, thank you. > > Has there a comparison been made between Drill and Presto in regards to > Parquet and low-latency queries? > > Regards, > -Stefan > > On Tue, Apr 12, 2016 at 5:41 PM, Ryan Blue <rb...@netflix.com.invalid> > wrote: > > > The 1.9 release is about ready, but blocked on a fix to the read path. We > > recently switched to a ByteBuffer-based read path and we identified some > > problems that require further testing. You can follow PARQUET-400 for it. > > I'm not sure what the timeline for this fix is. > > > > As far as "staging format", I always recommend Avro since there is good > > support for the Avro object model on top of both formats. Avro's also a > > Hadoop format (splittable) if you need to process data before moving it > to > > Parquet for long-term storage. > > > > Your multi-column situation sounds somewhat reasonable. You're probably > > seeking before each column read. You could try grouping columns that will > > be selected together in the schema. We could also improve how Parquet > reads > > data here, which is currently pretty simple. It will read contiguous > > chunks, but a gap of even 1 byte will cause a seek operation. We could > have > > a threshold that groups read operations together if they aren't separated > > by at least that many bytes. > > > > We currently use Presto for low-latency queries, Pig for ETL, Spark for > ETL > > and other tasks, and Hive for ETL and batch SQL. We're interested in > > getting the dictionary row group filter support in all of those since > we've > > seen great results with it. > > > > rb > > > > On Tue, Apr 12, 2016 at 10:29 AM, Stefán Baxter < > ste...@activitystream.com > > > > > wrote: > > > > > Thanks Ryan, > > > > > > Very exciting news! When is 1.9 expected to become available? (roughly) > > > > > > I sure hope the Avro stuff makes it into parquet-avro :) > > > > > > Is there a preferred "staging" format other than Avro that you would > > > recommend? > > > > > > Regarding the multi-column overhead: > > > > > > - I have a single table (no joins) in 1 parquet files > > > with multiple segment files (8 segment files) > > > - In my test I have ~2 million entries in a fairly wide table (150 > > > columns) > > > - Selecting a single column + count(*) with a simple where clause > (On > > > two fields, neither returned) takes ~1 second to run > > > - When I run the same query with 8 columns selected and grouped the > > > query takes ~ 8 seconds > > > - The only overhead involved should be fetching values from other > > > columns and grouping+aggregating > > > > > > What environment/tools are you using when querying Parquet? > > > > > > Regards, > > > -Stefán > > > > > > > > > > > > On Tue, Apr 12, 2016 at 5:11 PM, Ryan Blue <rb...@netflix.com.invalid> > > > wrote: > > > > > > > Stefán, > > > > > > > > I can't speak for when features are going to be available to Drill, > but > > > for > > > > Parquet, I can give you a few updates: > > > > > > > > * Time-stamp support (bigint+delta_encoding): This is going to be in > > the > > > > 1.9.0 release. Also, parquet-avro is waiting on a review for > supporting > > > > timestamp types, which may not make it in. > > > > > > > > * Predicate-Pushdown for dictionary values: A row group filter that > > uses > > > > dictionaries has been committed and will be in 1.9.0 as well. It's > > > working > > > > great for Pig jobs in our environment already. This requires setting > a > > > > property to enable it, parquet.filter.dictionary.enabled=true. > > > > > > > > * Bloom filters: The predicate support for dictionaries removes a lot > > of > > > > the need for bloom filters. I haven't done the calculations yet, but > > > > there's a narrow margin of % unique where bloom filters are valuable > > and > > > > the column isn't dictionary-encoded. We've not seen an example in our > > > data > > > > where we can't solve the problem by making dictionaries larger. You > can > > > > follow this at PARQUET-41 [1]. > > > > > > > > * Multi-column overhead: Can you give us more information here? What > > did > > > > the two tables look like? It could be that you're adding more seeks > to > > > pull > > > > down the data in the wide table case, but I'm not sure without more > > info. > > > > > > > > rb > > > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/PARQUET-41 > > > > > > > > On Mon, Apr 11, 2016 at 12:32 PM, Stefán Baxter < > > > ste...@activitystream.com > > > > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > We are using Parquet with Drill and we are quite happy, thank you > all > > > > very > > > > > much. > > > > > > > > > > We use Drill to query it and I wonder if there are some sort of > best > > > > > practices, recommended setup or any tips you could share. > > > > > > > > > > I also wanted to ask about some of the thing we think/hope are in > > scope > > > > and > > > > > what affect they will have on performance. > > > > > > > > > > *Time-stamp support (bigint+delta_encoding) * > > > > > We are using Avro for inbound/fresh data and I believe 1.8 finally > > has > > > > > date/timestamp support and I wonder when Parquet will support > > > > > timestamp+mills in a more efficient (encoded) way. > > > > > > > > > > *Predicate-Pushdown for dictionary values* > > > > > I hope I'm using the right terms but I'm basically referring to the > > > > ability > > > > > to skip segments if a the value being searched is not in the > > dictionary > > > > for > > > > > that segment (when/if dictionary encoding is used). I may be wrong > in > > > > > thinking that this will speed up our queries quite a bit but I > think > > > our > > > > > date and some of our queries would. > > > > > > > > > > *Bloom Filters* > > > > > I monitored some discussion here on implementing bloom filters and > > some > > > > > initial tests that were done to assess possible benefits. How did > > that > > > > go? > > > > > (Meaning will it be done and are there any initial numbers > regarding > > > > > potential gain) > > > > > > > > > > *Multi column overhead* > > > > > We are seeing that queries that fetch values from many columns are > a > > > lot > > > > > slower then the "same" queries when run with only a few columns. > This > > > is > > > > to > > > > > be expected but I wonder if there are any tricks/tips available > here. > > > We > > > > > are, for example, using nested structures that could be flattened > but > > > > that > > > > > seems irrelevant. > > > > > > > > > > Best regards, > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > -- > > > > Ryan Blue > > > > Software Engineer > > > > Netflix > > > > > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > -- Ryan Blue Software Engineer Netflix