[GitHub] [arrow-datafusion] thinkharderdev commented on issue #3360: Support `RowFilter` in `ParquetExec`

GitBox Mon, 05 Sep 2022 06:04:15 -0700


thinkharderdev commented on issue #3360:
URL: 
https://github.com/apache/arrow-datafusion/issues/3360#issuecomment-1236992633


   > if we filter zero page, it will run slower than before.
   
   This isn't necessarily the case. Even if we don't prune any pages it can 
still be a pretty significant performance boost to skip decoding. 
   
   The general problem with selectivity is that we really don't have much to go 
on at the time we need to build the filters. We have parquet metadata but that 
isn't much :). I think the approach I'll go with for the draft PR is something 
like:
   
   1. Break apart all conjunctions.
   2. Consider "simple predicates" (binary expressions, is/not null, is 
true/false, etc).
   3. Apply filters on sorted columns first to potentially leverage the page 
index.
   4. After that just use total size as the ordering (eg expressions which need 
to read less data go first).
   
   From there we can tweak it to include fancier hueristics (null counts, etc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #3360: Support `RowFilter` in `ParquetExec`

Reply via email to