That makes sense, yep. The problem is I guess with my implementation. I will iterate through all Parquet files and try to eliminate ones where the filter conflicts with the statistics. In instances where no files match the filter, I end up with an empty set of files for the Parquet scan to iterate through. I suppose I could just pick the schema of the first file or something, but that seems like a pretty messy rule.
Julien - I'd be happy to have a chat about this. I've pretty much got the implementation down, but need to solve a few of these little issues. On Fri, Dec 4, 2015 at 5:22 AM, Hanifi GUNES <[email protected]> wrote: > Regarding your point #1. I guess Daniel struggled with this limitation as > well. I merged few of his patches which addressed empty batch(no data) > handling in various places during execution. That said, however, we still > could not have time to develop a solid way to handle empty batches with no > schema. > > *- Scan batches don't allow empty batches. This means if a > particular filter filters out *all* rows, we get an exception.* > Looks to me, you are referring to no data rather than no schema here. I > would expect graceful execution in this case. Do you mind sharing a simple > reproduction? > > > -Hanifi > > 2015-12-03 10:56 GMT-08:00 Julien Le Dem <[email protected]>: > > > Hey Adam, > > If you have questions about the Parquet side of things, I'm happy to > chat. > > Julien > > > > On Tue, Dec 1, 2015 at 10:20 PM, Parth Chandra <[email protected]> > wrote: > > > > > Parquet metadata has the rowCount for every rowGroup which is also the > > > value count for every column in the rowGroup. Isn't that what you need? > > > > > > On Tue, Dec 1, 2015 at 10:10 PM, Adam Gilmore <[email protected]> > > > wrote: > > > > > > > Hi guys, > > > > > > > > I'm trying to (re)implement pushdown filtering for Parquet with the > new > > > > Parquet metadata caching implementation. > > > > > > > > I've run into a couple of challenges: > > > > > > > > 1. Scan batches don't allow empty batches. This means if a > > particular > > > > filter filters out *all* rows, we get an exception. I haven't > read > > > the > > > > full comments on the relevant JIRA items, but it seems odd that we > > > can't > > > > query an empty JSON file, for example. This is a bit of a blocker > > to > > > > implement the pushdown filtering properly. > > > > 2. The Parquet metadata doesn't include all the relevant metadata. > > > > Specifically, count of values is not included, therefore the > default > > > > Parquet statistics filter has issues because it compares the count > > of > > > > values with count of nulls to work out if it can drop it. This > > isn't > > > > necessarily a blocker, but it feels ugly simulating there's "1" > row > > > in a > > > > block (just to get around the null comparison). > > > > > > > > Also, it feels a bit ugly rehydrating the standard Parquet metadata > > > objects > > > > manually. I'm not sure I understand why we created our own objects > for > > > the > > > > Parquet metadata as opposed to simply writing a custom serializer for > > > those > > > > objects which we store. > > > > > > > > Thoughts would be great - I'd love to get a patch out for this. > > > > > > > > > > > > > > > -- > > Julien > > >
