Re: Various question on Parquet

Stefán Baxter Thu, 26 Jan 2017 06:39:08 -0800

Hi Ryan,

We have now been using Parquet (w/Drill on local file system and in
Hadoop+Spark context) for a while and we love it.


There are a few things that I would like to get your feedback on if you
have the time.


   - Rough set indexes for segments or any other alternatives on the
   Parquet roadmap (IndexR / Inforbright)
   - I remember some bloom filter checks that did not, I think, materialize
   or were not as beneficial as first intended

   - Full predicate push-down in Drill and what performance improvements it
   might unlock

   - Delta encoding of timestamps (bigint) is available now in Parquet,
   right?

Is there a roadmap available somewhere or are there any specific
improvements in speed/compression that I would love to know about?

I hope we can start contributing later this year as we would love to also
be on that end :).

Thank you again, the Parquet team has made our life a lot easier (S3+Presto
is another use-case we love)

Regards,
 -Stefán

On Mon, May 30, 2016 at 4:33 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> I think the easiest place is to look for new artifacts in maven central:
>
>   http://search.maven.org/#search|ga|1|org.apache.parquet
>
> I'm not sure why the download page has 2.1, we should fix that.
>
> rb
>
> On Sun, May 29, 2016 at 7:28 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi Ryan,
> >
> > Where should I be monitoring these releases.
> >
> > I see that https://parquet.apache.org/downloads/  references version 2.1
> > and that confuses me a bit :).
> >
> > Regards,
> >  -Stefán
> >
> > On Thu, Apr 14, 2016 at 3:57 PM, Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >
> > > Stefán,
> > >
> > > It's common for your Hadoop distribution vendor to backport the
> features
> > > you need in their next release, so checking with them is a good start.
> > >
> > > Right now, we're running these tests on 1.8.1 with dictionary filtering
> > > backported. I think the goal for the Parquet community is to get the
> > 1.9.0
> > > release out without maintaining a 1.8.x line, but we could put out a
> > 1.8.2
> > > if there's interest and the 1.9.0 branch is blocked for a long time.
> This
> > > is a good thing to discuss at the next Parquet sync-up.
> > >
> > > rb
> > >
> > > On Thu, Apr 14, 2016 at 8:45 AM, Stefán Baxter <
> > ste...@activitystream.com>
> > > wrote:
> > >
> > > > Hi Ryan,
> > > >
> > > > What tools do you use when running queries on 1.9 while it's in
> > > > development, it that Pig only?
> > > >
> > > > I ask because I'm quite curios to what performance improvements we
> > might
> > > > gain with the new dictionary filters.
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > > On Tue, Apr 12, 2016 at 6:57 PM, Ryan Blue <rb...@netflix.com.invalid
> >
> > > > wrote:
> > > >
> > > > > I'm not very qualified to answer comparison questions like that.
> You
> > > may
> > > > > want to check with the Apache Drill community for their performance
> > > > > numbers.
> > > > >
> > > > > On Tue, Apr 12, 2016 at 10:46 AM, Stefán Baxter <
> > > > ste...@activitystream.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Great, thank you.
> > > > > >
> > > > > > Has there a comparison been made between Drill and Presto in
> > regards
> > > to
> > > > > > Parquet and low-latency queries?
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefan
> > > > > >
> > > > > > On Tue, Apr 12, 2016 at 5:41 PM, Ryan Blue
> > <rb...@netflix.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > The 1.9 release is about ready, but blocked on a fix to the
> read
> > > > path.
> > > > > We
> > > > > > > recently switched to a ByteBuffer-based read path and we
> > identified
> > > > > some
> > > > > > > problems that require further testing. You can follow
> PARQUET-400
> > > for
> > > > > it.
> > > > > > > I'm not sure what the timeline for this fix is.
> > > > > > >
> > > > > > > As far as "staging format", I always recommend Avro since there
> > is
> > > > good
> > > > > > > support for the Avro object model on top of both formats.
> Avro's
> > > > also a
> > > > > > > Hadoop format (splittable) if you need to process data before
> > > moving
> > > > it
> > > > > > to
> > > > > > > Parquet for long-term storage.
> > > > > > >
> > > > > > > Your multi-column situation sounds somewhat reasonable. You're
> > > > probably
> > > > > > > seeking before each column read. You could try grouping columns
> > > that
> > > > > will
> > > > > > > be selected together in the schema. We could also improve how
> > > Parquet
> > > > > > reads
> > > > > > > data here, which is currently pretty simple. It will read
> > > contiguous
> > > > > > > chunks, but a gap of even 1 byte will cause a seek operation.
> We
> > > > could
> > > > > > have
> > > > > > > a threshold that groups read operations together if they aren't
> > > > > separated
> > > > > > > by at least that many bytes.
> > > > > > >
> > > > > > > We currently use Presto for low-latency queries, Pig for ETL,
> > Spark
> > > > for
> > > > > > ETL
> > > > > > > and other tasks, and Hive for ETL and batch SQL. We're
> interested
> > > in
> > > > > > > getting the dictionary row group filter support in all of those
> > > since
> > > > > > we've
> > > > > > > seen great results with it.
> > > > > > >
> > > > > > > rb
> > > > > > >
> > > > > > > On Tue, Apr 12, 2016 at 10:29 AM, Stefán Baxter <
> > > > > > ste...@activitystream.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Ryan,
> > > > > > > >
> > > > > > > > Very exciting news! When is 1.9 expected to become available?
> > > > > (roughly)
> > > > > > > >
> > > > > > > > I sure hope the Avro stuff makes it into parquet-avro :)
> > > > > > > >
> > > > > > > > Is there a preferred "staging" format other than Avro that
> you
> > > > would
> > > > > > > > recommend?
> > > > > > > >
> > > > > > > > Regarding the multi-column overhead:
> > > > > > > >
> > > > > > > >    - I have a single table (no joins) in 1 parquet files
> > > > > > > >    with multiple segment files (8 segment files)
> > > > > > > >    - In my test I have ~2 million entries in a fairly wide
> > table
> > > > (150
> > > > > > > >    columns)
> > > > > > > >    - Selecting a single column + count(*) with a simple where
> > > > clause
> > > > > > (On
> > > > > > > >    two fields, neither returned) takes ~1 second to run
> > > > > > > >    - When I run the same query with 8 columns selected and
> > > grouped
> > > > > the
> > > > > > > >    query takes ~ 8 seconds
> > > > > > > >    - The only overhead involved should be fetching values
> from
> > > > other
> > > > > > > >    columns and grouping+aggregating
> > > > > > > >
> > > > > > > > What environment/tools are you using when querying Parquet?
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 12, 2016 at 5:11 PM, Ryan Blue
> > > > <rb...@netflix.com.invalid
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Stefán,
> > > > > > > > >
> > > > > > > > > I can't speak for when features are going to be available
> to
> > > > Drill,
> > > > > > but
> > > > > > > > for
> > > > > > > > > Parquet, I can give you a few updates:
> > > > > > > > >
> > > > > > > > > * Time-stamp support (bigint+delta_encoding): This is going
> > to
> > > be
> > > > > in
> > > > > > > the
> > > > > > > > > 1.9.0 release. Also, parquet-avro is waiting on a review
> for
> > > > > > supporting
> > > > > > > > > timestamp types, which may not make it in.
> > > > > > > > >
> > > > > > > > > * Predicate-Pushdown for dictionary values: A row group
> > filter
> > > > that
> > > > > > > uses
> > > > > > > > > dictionaries has been committed and will be in 1.9.0 as
> well.
> > > > It's
> > > > > > > > working
> > > > > > > > > great for Pig jobs in our environment already. This
> requires
> > > > > setting
> > > > > > a
> > > > > > > > > property to enable it,
> > parquet.filter.dictionary.enabled=true.
> > > > > > > > >
> > > > > > > > > * Bloom filters: The predicate support for dictionaries
> > > removes a
> > > > > lot
> > > > > > > of
> > > > > > > > > the need for bloom filters. I haven't done the calculations
> > > yet,
> > > > > but
> > > > > > > > > there's a narrow margin of % unique where bloom filters are
> > > > > valuable
> > > > > > > and
> > > > > > > > > the column isn't dictionary-encoded. We've not seen an
> > example
> > > in
> > > > > our
> > > > > > > > data
> > > > > > > > > where we can't solve the problem by making dictionaries
> > larger.
> > > > You
> > > > > > can
> > > > > > > > > follow this at PARQUET-41 [1].
> > > > > > > > >
> > > > > > > > > * Multi-column overhead: Can you give us more information
> > here?
> > > > > What
> > > > > > > did
> > > > > > > > > the two tables look like? It could be that you're adding
> more
> > > > seeks
> > > > > > to
> > > > > > > > pull
> > > > > > > > > down the data in the wide table case, but I'm not sure
> > without
> > > > more
> > > > > > > info.
> > > > > > > > >
> > > > > > > > > rb
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]: https://issues.apache.org/jira/browse/PARQUET-41
> > > > > > > > >
> > > > > > > > > On Mon, Apr 11, 2016 at 12:32 PM, Stefán Baxter <
> > > > > > > > ste...@activitystream.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > We are using Parquet with Drill and we are quite happy,
> > thank
> > > > you
> > > > > > all
> > > > > > > > > very
> > > > > > > > > > much.
> > > > > > > > > >
> > > > > > > > > > We use Drill to query it and I wonder if there are some
> > sort
> > > of
> > > > > > best
> > > > > > > > > > practices, recommended setup or any tips you could share.
> > > > > > > > > >
> > > > > > > > > > I also wanted to ask about some of the thing we
> think/hope
> > > are
> > > > in
> > > > > > > scope
> > > > > > > > > and
> > > > > > > > > > what affect they will have on performance.
> > > > > > > > > >
> > > > > > > > > > *Time-stamp support (bigint+delta_encoding) *
> > > > > > > > > > We are using Avro for inbound/fresh data and I believe
> 1.8
> > > > > finally
> > > > > > > has
> > > > > > > > > > date/timestamp support and I wonder when Parquet will
> > support
> > > > > > > > > > timestamp+mills in a more efficient (encoded) way.
> > > > > > > > > >
> > > > > > > > > > *Predicate-Pushdown for dictionary values*
> > > > > > > > > > I hope I'm using the right terms but I'm basically
> > referring
> > > to
> > > > > the
> > > > > > > > > ability
> > > > > > > > > > to skip segments if a the value being searched is not in
> > the
> > > > > > > dictionary
> > > > > > > > > for
> > > > > > > > > > that segment (when/if dictionary encoding is used). I may
> > be
> > > > > wrong
> > > > > > in
> > > > > > > > > > thinking that this will speed up our queries quite a bit
> > but
> > > I
> > > > > > think
> > > > > > > > our
> > > > > > > > > > date and some of our queries would.
> > > > > > > > > >
> > > > > > > > > > *Bloom Filters*
> > > > > > > > > > I monitored some discussion here on implementing bloom
> > > filters
> > > > > and
> > > > > > > some
> > > > > > > > > > initial tests that were done to assess possible benefits.
> > How
> > > > did
> > > > > > > that
> > > > > > > > > go?
> > > > > > > > > > (Meaning will it be done and are there any initial
> numbers
> > > > > > regarding
> > > > > > > > > > potential gain)
> > > > > > > > > >
> > > > > > > > > > *Multi column overhead*
> > > > > > > > > > We are seeing that queries that fetch values from many
> > > columns
> > > > > are
> > > > > > a
> > > > > > > > lot
> > > > > > > > > > slower then the "same" queries when run with only a few
> > > > columns.
> > > > > > This
> > > > > > > > is
> > > > > > > > > to
> > > > > > > > > > be expected but I wonder if there are any tricks/tips
> > > available
> > > > > > here.
> > > > > > > > We
> > > > > > > > > > are, for example, using nested structures that could be
> > > > flattened
> > > > > > but
> > > > > > > > > that
> > > > > > > > > > seems irrelevant.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > >  -Stefán
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Ryan Blue
> > > > > > > > > Software Engineer
> > > > > > > > > Netflix
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Ryan Blue
> > > > > > > Software Engineer
> > > > > > > Netflix
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Various question on Parquet

Reply via email to