Re: Parquet pushdown filtering

Adam Gilmore Thu, 10 Dec 2015 18:16:11 -0800

Could we say Monday or Tuesday next week?  I'm actually ahead of you guys
by about 18 hours, so Monday morning my time would be Sunday
afternoon/evening for you.  If that doesn't work, what about Tuesday
morning my time - Monday afternoon/evening your time?


On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse <[email protected]>
wrote:

> I can also join for this meeting, Julien and I are both on SF time. Looks
> like you are about 5-6 hours behind us, so depending on if you would prefer
> morning or afternoon we'll just be a little further into our days.
>
> On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore <[email protected]>
> wrote:
>
> > Sure - I'm in Australia so I'm not sure how the timezones will work for
> > you guys, but I'm pretty flexible.  Where are you located?
> >
> > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <[email protected]> wrote:
> >
> > > Adam: do you want to schedule a hangout?
> > >
> > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore <[email protected]>
> > > wrote:
> > >
> > > > That makes sense, yep.  The problem is I guess with my
> > > implementation.  I
> > > > will iterate through all Parquet files and try to eliminate ones
> where
> > > the
> > > > filter conflicts with the statistics.  In instances where no files
> > match
> > > > the filter, I end up with an empty set of files for the Parquet scan
> to
> > > > iterate through.  I suppose I could just pick the schema of the first
> > > file
> > > > or something, but that seems like a pretty messy rule.
> > > >
> > > > Julien - I'd be happy to have a chat about this.  I've pretty much
> got
> > > the
> > > > implementation down, but need to solve a few of these little issues.
> > > >
> > > >
> > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <[email protected]>
> > > > wrote:
> > > >
> > > > > Regarding your point  #1. I guess Daniel struggled with this
> > limitation
> > > > as
> > > > > well. I merged few of his patches which addressed empty batch(no
> > data)
> > > > > handling in various places during execution. That said, however, we
> > > still
> > > > > could not have time to develop a solid way to handle empty batches
> > with
> > > > no
> > > > > schema.
> > > > >
> > > > > *- Scan batches don't allow empty batches.  This means if a
> > > > > particular filter filters out *all* rows, we get an exception.*
> > > > > Looks to me, you are referring to no data rather than no schema
> > here. I
> > > > > would expect graceful execution in this case. Do you mind sharing a
> > > > simple
> > > > > reproduction?
> > > > >
> > > > >
> > > > > -Hanifi
> > > > >
> > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <[email protected]>:
> > > > >
> > > > > > Hey Adam,
> > > > > > If you have questions about the Parquet side of things, I'm happy
> > to
> > > > > chat.
> > > > > > Julien
> > > > > >
> > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > Parquet metadata has the rowCount for every rowGroup which is
> > also
> > > > the
> > > > > > > value count for every column in the rowGroup. Isn't that what
> you
> > > > need?
> > > > > > >
> > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi guys,
> > > > > > > >
> > > > > > > > I'm trying to (re)implement pushdown filtering for Parquet
> with
> > > the
> > > > > new
> > > > > > > > Parquet metadata caching implementation.
> > > > > > > >
> > > > > > > > I've run into a couple of challenges:
> > > > > > > >
> > > > > > > >    1. Scan batches don't allow empty batches.  This means if
> a
> > > > > > particular
> > > > > > > >    filter filters out *all* rows, we get an exception.  I
> > haven't
> > > > > read
> > > > > > > the
> > > > > > > >    full comments on the relevant JIRA items, but it seems odd
> > > that
> > > > we
> > > > > > > can't
> > > > > > > >    query an empty JSON file, for example.  This is a bit of a
> > > > blocker
> > > > > > to
> > > > > > > >    implement the pushdown filtering properly.
> > > > > > > >    2. The Parquet metadata doesn't include all the relevant
> > > > metadata.
> > > > > > > >    Specifically, count of values is not included, therefore
> the
> > > > > default
> > > > > > > >    Parquet statistics filter has issues because it compares
> the
> > > > count
> > > > > > of
> > > > > > > >    values with count of nulls to work out if it can drop it.
> > > This
> > > > > > isn't
> > > > > > > >    necessarily a blocker, but it feels ugly simulating
> there's
> > > "1"
> > > > > row
> > > > > > > in a
> > > > > > > >    block (just to get around the null comparison).
> > > > > > > >
> > > > > > > > Also, it feels a bit ugly rehydrating the standard Parquet
> > > metadata
> > > > > > > objects
> > > > > > > > manually.  I'm not sure I understand why we created our own
> > > objects
> > > > > for
> > > > > > > the
> > > > > > > > Parquet metadata as opposed to simply writing a custom
> > serializer
> > > > for
> > > > > > > those
> > > > > > > > objects which we store.
> > > > > > > >
> > > > > > > > Thoughts would be great - I'd love to get a patch out for
> this.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Julien
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>

Re: Parquet pushdown filtering

Reply via email to