thinkharderdev commented on issue #3360: URL: https://github.com/apache/arrow-datafusion/issues/3360#issuecomment-1236992633
> if we filter zero page, it will run slower than before. This isn't necessarily the case. Even if we don't prune any pages it can still be a pretty significant performance boost to skip decoding. The general problem with selectivity is that we really don't have much to go on at the time we need to build the filters. We have parquet metadata but that isn't much :). I think the approach I'll go with for the draft PR is something like: 1. Break apart all conjunctions. 2. Consider "simple predicates" (binary expressions, is/not null, is true/false, etc). 3. Apply filters on sorted columns first to potentially leverage the page index. 4. After that just use total size as the ordering (eg expressions which need to read less data go first). From there we can tweak it to include fancier hueristics (null counts, etc) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
