alamb opened a new pull request, #8397: URL: https://github.com/apache/arrow-datafusion/pull/8397
## Which issue does this PR close? POC for https://github.com/apache/arrow-datafusion/issues/8376 ## Rationale for this change See https://github.com/apache/arrow-datafusion/issues/8376 TLDR is I would like to use `PruningPredicate` for "bloom filter like" operations downstream in IOx, and thus I would like to refactor the code that applies Bloom Filters today into the Pruningpredicate ## What changes are included in this PR? This 1 I implemented good progress implementing the `contains` API in pruning predicate 2. I also protoyped hacking it into the row group pruning code, ## TODO I discovered two interesting cases that I need to make work to fully port the pruning predicate code: 1. IN LISTS (aka `col IN (1,2,3)`) 2. predicates like `col = 'foo' OR col2 = 'bar'` (aka disjunctions) which the current bloom filter code handle ## Are these changes tested? - [ ] New tests for PruningPredicate contains (=, !=, = with OR, `IN` and `NOT IN`) - [ ] Existing tests pass (IN PROGRESS) ## Are there any user-facing changes? The PruningPredicate has a new API ## Plans for merging I do not plan to propose merging this PR as is. Instead I plan a series of PRs: 1. A PR that introduces the `contains` API with tests 2. A second PR that refactors the the row group pruning code to use it 3. A third PR that refactors the RowGroupPruning code so it runs on more than one RowGroupMetadatat at a time. 🤔 musing: At the moment, the row group pruning happens in two phases -- first the min/max statistics are applied, and then the bloom filters are applied which means the potentially bloom filters are only fetched for groups that passed the first predicate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
