I'm not very qualified to answer comparison questions like that. You may
want to check with the Apache Drill community for their performance numbers.

On Tue, Apr 12, 2016 at 10:46 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Great, thank you.
>
> Has there a comparison been made between Drill and Presto in regards to
> Parquet and low-latency queries?
>
> Regards,
>  -Stefan
>
> On Tue, Apr 12, 2016 at 5:41 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > The 1.9 release is about ready, but blocked on a fix to the read path. We
> > recently switched to a ByteBuffer-based read path and we identified some
> > problems that require further testing. You can follow PARQUET-400 for it.
> > I'm not sure what the timeline for this fix is.
> >
> > As far as "staging format", I always recommend Avro since there is good
> > support for the Avro object model on top of both formats. Avro's also a
> > Hadoop format (splittable) if you need to process data before moving it
> to
> > Parquet for long-term storage.
> >
> > Your multi-column situation sounds somewhat reasonable. You're probably
> > seeking before each column read. You could try grouping columns that will
> > be selected together in the schema. We could also improve how Parquet
> reads
> > data here, which is currently pretty simple. It will read contiguous
> > chunks, but a gap of even 1 byte will cause a seek operation. We could
> have
> > a threshold that groups read operations together if they aren't separated
> > by at least that many bytes.
> >
> > We currently use Presto for low-latency queries, Pig for ETL, Spark for
> ETL
> > and other tasks, and Hive for ETL and batch SQL. We're interested in
> > getting the dictionary row group filter support in all of those since
> we've
> > seen great results with it.
> >
> > rb
> >
> > On Tue, Apr 12, 2016 at 10:29 AM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > Thanks Ryan,
> > >
> > > Very exciting news! When is 1.9 expected to become available? (roughly)
> > >
> > > I sure hope the Avro stuff makes it into parquet-avro :)
> > >
> > > Is there a preferred "staging" format other than Avro that you would
> > > recommend?
> > >
> > > Regarding the multi-column overhead:
> > >
> > >    - I have a single table (no joins) in 1 parquet files
> > >    with multiple segment files (8 segment files)
> > >    - In my test I have ~2 million entries in a fairly wide table (150
> > >    columns)
> > >    - Selecting a single column + count(*) with a simple where clause
> (On
> > >    two fields, neither returned) takes ~1 second to run
> > >    - When I run the same query with 8 columns selected and grouped the
> > >    query takes ~ 8 seconds
> > >    - The only overhead involved should be fetching values from other
> > >    columns and grouping+aggregating
> > >
> > > What environment/tools are you using when querying Parquet?
> > >
> > > Regards,
> > >  -Stefán
> > >
> > >
> > >
> > > On Tue, Apr 12, 2016 at 5:11 PM, Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Stefán,
> > > >
> > > > I can't speak for when features are going to be available to Drill,
> but
> > > for
> > > > Parquet, I can give you a few updates:
> > > >
> > > > * Time-stamp support (bigint+delta_encoding): This is going to be in
> > the
> > > > 1.9.0 release. Also, parquet-avro is waiting on a review for
> supporting
> > > > timestamp types, which may not make it in.
> > > >
> > > > * Predicate-Pushdown for dictionary values: A row group filter that
> > uses
> > > > dictionaries has been committed and will be in 1.9.0 as well. It's
> > > working
> > > > great for Pig jobs in our environment already. This requires setting
> a
> > > > property to enable it, parquet.filter.dictionary.enabled=true.
> > > >
> > > > * Bloom filters: The predicate support for dictionaries removes a lot
> > of
> > > > the need for bloom filters. I haven't done the calculations yet, but
> > > > there's a narrow margin of % unique where bloom filters are valuable
> > and
> > > > the column isn't dictionary-encoded. We've not seen an example in our
> > > data
> > > > where we can't solve the problem by making dictionaries larger. You
> can
> > > > follow this at PARQUET-41 [1].
> > > >
> > > > * Multi-column overhead: Can you give us more information here? What
> > did
> > > > the two tables look like? It could be that you're adding more seeks
> to
> > > pull
> > > > down the data in the wide table case, but I'm not sure without more
> > info.
> > > >
> > > > rb
> > > >
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/PARQUET-41
> > > >
> > > > On Mon, Apr 11, 2016 at 12:32 PM, Stefán Baxter <
> > > ste...@activitystream.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are using Parquet with Drill and we are quite happy, thank you
> all
> > > > very
> > > > > much.
> > > > >
> > > > > We use Drill to query it and I wonder if there are some sort of
> best
> > > > > practices, recommended setup or any tips you could share.
> > > > >
> > > > > I also wanted to ask about some of the thing we think/hope are in
> > scope
> > > > and
> > > > > what affect they will have on performance.
> > > > >
> > > > > *Time-stamp support (bigint+delta_encoding) *
> > > > > We are using Avro for inbound/fresh data and I believe 1.8 finally
> > has
> > > > > date/timestamp support and I wonder when Parquet will support
> > > > > timestamp+mills in a more efficient (encoded) way.
> > > > >
> > > > > *Predicate-Pushdown for dictionary values*
> > > > > I hope I'm using the right terms but I'm basically referring to the
> > > > ability
> > > > > to skip segments if a the value being searched is not in the
> > dictionary
> > > > for
> > > > > that segment (when/if dictionary encoding is used). I may be wrong
> in
> > > > > thinking that this will speed up our queries quite a bit but I
> think
> > > our
> > > > > date and some of our queries would.
> > > > >
> > > > > *Bloom Filters*
> > > > > I monitored some discussion here on implementing bloom filters and
> > some
> > > > > initial tests that were done to assess possible benefits. How did
> > that
> > > > go?
> > > > > (Meaning will it be done and are there any initial numbers
> regarding
> > > > > potential gain)
> > > > >
> > > > > *Multi column overhead*
> > > > > We are seeing that queries that fetch values from many columns are
> a
> > > lot
> > > > > slower then the "same" queries when run with only a few columns.
> This
> > > is
> > > > to
> > > > > be expected but I wonder if there are any tricks/tips available
> here.
> > > We
> > > > > are, for example, using nested structures that could be flattened
> but
> > > > that
> > > > > seems irrelevant.
> > > > >
> > > > > Best regards,
> > > > >  -Stefán
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to