Re: Various question on Parquet

Ryan Blue Tue, 12 Apr 2016 11:58:57 -0700

I'm not very qualified to answer comparison questions like that. You may
want to check with the Apache Drill community for their performance numbers.


On Tue, Apr 12, 2016 at 10:46 AM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Great, thank you.
>
> Has there a comparison been made between Drill and Presto in regards to
> Parquet and low-latency queries?
>
> Regards,
>  -Stefan
>
> On Tue, Apr 12, 2016 at 5:41 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > The 1.9 release is about ready, but blocked on a fix to the read path. We
> > recently switched to a ByteBuffer-based read path and we identified some
> > problems that require further testing. You can follow PARQUET-400 for it.
> > I'm not sure what the timeline for this fix is.
> >
> > As far as "staging format", I always recommend Avro since there is good
> > support for the Avro object model on top of both formats. Avro's also a
> > Hadoop format (splittable) if you need to process data before moving it
> to
> > Parquet for long-term storage.
> >
> > Your multi-column situation sounds somewhat reasonable. You're probably
> > seeking before each column read. You could try grouping columns that will
> > be selected together in the schema. We could also improve how Parquet
> reads
> > data here, which is currently pretty simple. It will read contiguous
> > chunks, but a gap of even 1 byte will cause a seek operation. We could
> have
> > a threshold that groups read operations together if they aren't separated
> > by at least that many bytes.
> >
> > We currently use Presto for low-latency queries, Pig for ETL, Spark for
> ETL
> > and other tasks, and Hive for ETL and batch SQL. We're interested in
> > getting the dictionary row group filter support in all of those since
> we've
> > seen great results with it.
> >
> > rb
> >
> > On Tue, Apr 12, 2016 at 10:29 AM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > Thanks Ryan,
> > >
> > > Very exciting news! When is 1.9 expected to become available? (roughly)
> > >
> > > I sure hope the Avro stuff makes it into parquet-avro :)
> > >
> > > Is there a preferred "staging" format other than Avro that you would
> > > recommend?
> > >
> > > Regarding the multi-column overhead:
> > >
> > >    - I have a single table (no joins) in 1 parquet files
> > >    with multiple segment files (8 segment files)
> > >    - In my test I have ~2 million entries in a fairly wide table (150
> > >    columns)
> > >    - Selecting a single column + count(*) with a simple where clause
> (On
> > >    two fields, neither returned) takes ~1 second to run
> > >    - When I run the same query with 8 columns selected and grouped the
> > >    query takes ~ 8 seconds
> > >    - The only overhead involved should be fetching values from other
> > >    columns and grouping+aggregating
> > >
> > > What environment/tools are you using when querying Parquet?
> > >
> > > Regards,
> > >  -Stefán
> > >
> > >
> > >
> > > On Tue, Apr 12, 2016 at 5:11 PM, Ryan Blue <rb...@netflix.com.invalid>
> > > wrote:
> > >
> > > > Stefán,
> > > >
> > > > I can't speak for when features are going to be available to Drill,
> but
> > > for
> > > > Parquet, I can give you a few updates:
> > > >
> > > > * Time-stamp support (bigint+delta_encoding): This is going to be in
> > the
> > > > 1.9.0 release. Also, parquet-avro is waiting on a review for
> supporting
> > > > timestamp types, which may not make it in.
> > > >
> > > > * Predicate-Pushdown for dictionary values: A row group filter that
> > uses
> > > > dictionaries has been committed and will be in 1.9.0 as well. It's
> > > working
> > > > great for Pig jobs in our environment already. This requires setting
> a
> > > > property to enable it, parquet.filter.dictionary.enabled=true.
> > > >
> > > > * Bloom filters: The predicate support for dictionaries removes a lot
> > of
> > > > the need for bloom filters. I haven't done the calculations yet, but
> > > > there's a narrow margin of % unique where bloom filters are valuable
> > and
> > > > the column isn't dictionary-encoded. We've not seen an example in our
> > > data
> > > > where we can't solve the problem by making dictionaries larger. You
> can
> > > > follow this at PARQUET-41 [1].
> > > >
> > > > * Multi-column overhead: Can you give us more information here? What
> > did
> > > > the two tables look like? It could be that you're adding more seeks
> to
> > > pull
> > > > down the data in the wide table case, but I'm not sure without more
> > info.
> > > >
> > > > rb
> > > >
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/PARQUET-41
> > > >
> > > > On Mon, Apr 11, 2016 at 12:32 PM, Stefán Baxter <
> > > ste...@activitystream.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are using Parquet with Drill and we are quite happy, thank you
> all
> > > > very
> > > > > much.
> > > > >
> > > > > We use Drill to query it and I wonder if there are some sort of
> best
> > > > > practices, recommended setup or any tips you could share.
> > > > >
> > > > > I also wanted to ask about some of the thing we think/hope are in
> > scope
> > > > and
> > > > > what affect they will have on performance.
> > > > >
> > > > > *Time-stamp support (bigint+delta_encoding) *
> > > > > We are using Avro for inbound/fresh data and I believe 1.8 finally
> > has
> > > > > date/timestamp support and I wonder when Parquet will support
> > > > > timestamp+mills in a more efficient (encoded) way.
> > > > >
> > > > > *Predicate-Pushdown for dictionary values*
> > > > > I hope I'm using the right terms but I'm basically referring to the
> > > > ability
> > > > > to skip segments if a the value being searched is not in the
> > dictionary
> > > > for
> > > > > that segment (when/if dictionary encoding is used). I may be wrong
> in
> > > > > thinking that this will speed up our queries quite a bit but I
> think
> > > our
> > > > > date and some of our queries would.
> > > > >
> > > > > *Bloom Filters*
> > > > > I monitored some discussion here on implementing bloom filters and
> > some
> > > > > initial tests that were done to assess possible benefits. How did
> > that
> > > > go?
> > > > > (Meaning will it be done and are there any initial numbers
> regarding
> > > > > potential gain)
> > > > >
> > > > > *Multi column overhead*
> > > > > We are seeing that queries that fetch values from many columns are
> a
> > > lot
> > > > > slower then the "same" queries when run with only a few columns.
> This
> > > is
> > > > to
> > > > > be expected but I wonder if there are any tricks/tips available
> here.
> > > We
> > > > > are, for example, using nested structures that could be flattened
> but
> > > > that
> > > > > seems irrelevant.
> > > > >
> > > > > Best regards,
> > > > >  -Stefán
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Various question on Parquet

Reply via email to