Hi Adam, I will update DRILL-2287 with some comments because it has more context than this discussion thread. We can continue the discussion there. The issue of the invalid 0 length parquet files being read sounds like a different issue.
Aman On Sun, Mar 22, 2015 at 6:48 PM, Adam Gilmore <[email protected]> wrote: > Hi guys, > > I'm trying to work on an issue I've raised with partition pruning: > > https://issues.apache.org/jira/browse/DRILL-2287 > > Basically, because the partition pruning is done after the > DrillPushProjIntoScan, it seems like we can't detect that dir0 (for > example) is not actually needed to be projected if it's not in the SELECT > clause (or GROUP BY etc.). > > Moreover, I've come up with an issue whereby if I have, for example, 3 > directories - 2 with valid Parquet files and 1 with an invalid 0-byte > Parquet file, even if we're partition pruning to only the valid > directories, the query will fail (because it's trying to read the footer of > the invalid Parquet file). > > It really feels like the partition pruning should be done before the > DrillPushProjIntoScan. > > I know Jacques has just done some work on moving the partition pruning, so > I thought I'd open the discussion here first before making too many > in-roads into it. > > I do feel if we're partition pruning, we shouldn't even try to read any of > those other directories during the planning stage. Furthermore, it doesn't > make sense to prune the files being scanned but still keep a Filter > operation in the query plan and project dir0 throughout it if it's not > needed. The latter is why the queries end up being a lot slower. > > Thoughts? >
