That works for me. On Mon, Dec 14, 2015 at 3:33 AM, Adam Gilmore <[email protected]> wrote:
> Shall we say 10am my time, 4pm your time? > > On Sunday, 13 December 2015, Julien Le Dem <[email protected]> wrote: > > > Tuesday morning in Australia, Monday afternoon in California sounds good > to > > me. > > > > On Fri, Dec 11, 2015 at 11:42 AM, Parth Chandra <[email protected] > > <javascript:;>> wrote: > > > > > I'd like to attend as well. Any time that works for Julien/Jason works > > for > > > me. > > > > > > > > > > > > > > > > > > On Thu, Dec 10, 2015 at 6:15 PM, Adam Gilmore <[email protected] > > <javascript:;>> > > > wrote: > > > > > > > Could we say Monday or Tuesday next week? I'm actually ahead of you > > guys > > > > by about 18 hours, so Monday morning my time would be Sunday > > > > afternoon/evening for you. If that doesn't work, what about Tuesday > > > > morning my time - Monday afternoon/evening your time? > > > > > > > > On Fri, Dec 11, 2015 at 1:30 AM, Jason Altekruse < > > > [email protected] <javascript:;> > > > > > > > > > wrote: > > > > > > > > > I can also join for this meeting, Julien and I are both on SF time. > > > Looks > > > > > like you are about 5-6 hours behind us, so depending on if you > would > > > > prefer > > > > > morning or afternoon we'll just be a little further into our days. > > > > > > > > > > On Wed, Dec 9, 2015 at 7:16 PM, Adam Gilmore < > [email protected] > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > Sure - I'm in Australia so I'm not sure how the timezones will > > work > > > > for > > > > > > you guys, but I'm pretty flexible. Where are you located? > > > > > > > > > > > > On Wed, Dec 9, 2015 at 5:48 AM, Julien Le Dem <[email protected] > > <javascript:;>> > > > > wrote: > > > > > > > > > > > > > Adam: do you want to schedule a hangout? > > > > > > > > > > > > > > On Tue, Dec 8, 2015 at 4:59 AM, Adam Gilmore < > > > [email protected] <javascript:;>> > > > > > > > wrote: > > > > > > > > > > > > > > > That makes sense, yep. The problem is I guess with my > > > > > > > implementation. I > > > > > > > > will iterate through all Parquet files and try to eliminate > > ones > > > > > where > > > > > > > the > > > > > > > > filter conflicts with the statistics. In instances where no > > > files > > > > > > match > > > > > > > > the filter, I end up with an empty set of files for the > Parquet > > > > scan > > > > > to > > > > > > > > iterate through. I suppose I could just pick the schema of > the > > > > first > > > > > > > file > > > > > > > > or something, but that seems like a pretty messy rule. > > > > > > > > > > > > > > > > Julien - I'd be happy to have a chat about this. I've pretty > > > much > > > > > got > > > > > > > the > > > > > > > > implementation down, but need to solve a few of these little > > > > issues. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES < > > > > [email protected] <javascript:;>> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Regarding your point #1. I guess Daniel struggled with > this > > > > > > limitation > > > > > > > > as > > > > > > > > > well. I merged few of his patches which addressed empty > > > batch(no > > > > > > data) > > > > > > > > > handling in various places during execution. That said, > > > however, > > > > we > > > > > > > still > > > > > > > > > could not have time to develop a solid way to handle empty > > > > batches > > > > > > with > > > > > > > > no > > > > > > > > > schema. > > > > > > > > > > > > > > > > > > *- Scan batches don't allow empty batches. This means if a > > > > > > > > > particular filter filters out *all* rows, we get an > > exception.* > > > > > > > > > Looks to me, you are referring to no data rather than no > > schema > > > > > > here. I > > > > > > > > > would expect graceful execution in this case. Do you mind > > > > sharing a > > > > > > > > simple > > > > > > > > > reproduction? > > > > > > > > > > > > > > > > > > > > > > > > > > > -Hanifi > > > > > > > > > > > > > > > > > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem < > [email protected] > > <javascript:;>>: > > > > > > > > > > > > > > > > > > > Hey Adam, > > > > > > > > > > If you have questions about the Parquet side of things, > I'm > > > > happy > > > > > > to > > > > > > > > > chat. > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra < > > > > > [email protected] <javascript:;>> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Parquet metadata has the rowCount for every rowGroup > > which > > > is > > > > > > also > > > > > > > > the > > > > > > > > > > > value count for every column in the rowGroup. Isn't > that > > > what > > > > > you > > > > > > > > need? > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore < > > > > > > > [email protected] <javascript:;> > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > > > > > > > > > > > I'm trying to (re)implement pushdown filtering for > > > Parquet > > > > > with > > > > > > > the > > > > > > > > > new > > > > > > > > > > > > Parquet metadata caching implementation. > > > > > > > > > > > > > > > > > > > > > > > > I've run into a couple of challenges: > > > > > > > > > > > > > > > > > > > > > > > > 1. Scan batches don't allow empty batches. This > > means > > > > if > > > > > a > > > > > > > > > > particular > > > > > > > > > > > > filter filters out *all* rows, we get an > > exception. I > > > > > > haven't > > > > > > > > > read > > > > > > > > > > > the > > > > > > > > > > > > full comments on the relevant JIRA items, but it > > seems > > > > odd > > > > > > > that > > > > > > > > we > > > > > > > > > > > can't > > > > > > > > > > > > query an empty JSON file, for example. This is a > > bit > > > > of a > > > > > > > > blocker > > > > > > > > > > to > > > > > > > > > > > > implement the pushdown filtering properly. > > > > > > > > > > > > 2. The Parquet metadata doesn't include all the > > > relevant > > > > > > > > metadata. > > > > > > > > > > > > Specifically, count of values is not included, > > > therefore > > > > > the > > > > > > > > > default > > > > > > > > > > > > Parquet statistics filter has issues because it > > > compares > > > > > the > > > > > > > > count > > > > > > > > > > of > > > > > > > > > > > > values with count of nulls to work out if it can > > drop > > > > it. > > > > > > > This > > > > > > > > > > isn't > > > > > > > > > > > > necessarily a blocker, but it feels ugly > simulating > > > > > there's > > > > > > > "1" > > > > > > > > > row > > > > > > > > > > > in a > > > > > > > > > > > > block (just to get around the null comparison). > > > > > > > > > > > > > > > > > > > > > > > > Also, it feels a bit ugly rehydrating the standard > > > Parquet > > > > > > > metadata > > > > > > > > > > > objects > > > > > > > > > > > > manually. I'm not sure I understand why we created > our > > > own > > > > > > > objects > > > > > > > > > for > > > > > > > > > > > the > > > > > > > > > > > > Parquet metadata as opposed to simply writing a > custom > > > > > > serializer > > > > > > > > for > > > > > > > > > > > those > > > > > > > > > > > > objects which we store. > > > > > > > > > > > > > > > > > > > > > > > > Thoughts would be great - I'd love to get a patch out > > for > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Julien > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Julien > > >
