alamb opened a new pull request, #8397:
URL: https://github.com/apache/arrow-datafusion/pull/8397

   ## Which issue does this PR close?
   POC for https://github.com/apache/arrow-datafusion/issues/8376
   
   
   ## Rationale for this change
   
   See https://github.com/apache/arrow-datafusion/issues/8376
   
   TLDR is I would like to use `PruningPredicate` for "bloom filter like" 
operations downstream in IOx, and thus I would like to refactor the code that 
applies Bloom Filters today into the Pruningpredicate
   
   
   
   
   
   
   ## What changes are included in this PR?
   
   This 
   1 I implemented  good progress implementing the `contains` API in pruning 
predicate
   2. I also protoyped hacking it into the row group pruning code,
   
   ## TODO
   
   I discovered two interesting cases that I need to make work to fully port 
the pruning predicate code:
   1. IN LISTS (aka `col IN (1,2,3)`)
   2. predicates like `col = 'foo' OR col2 = 'bar'` (aka disjunctions) which 
the current bloom filter code handle
   
   
   ## Are these changes tested?
   
   - [ ] New tests for PruningPredicate contains (=, !=, = with OR, `IN` and 
`NOT IN`)
   - [ ] Existing tests pass (IN PROGRESS)
   
   ## Are there any user-facing changes?
   The PruningPredicate has a new API
   
   
   ## Plans for merging
   I do not plan to propose merging this PR as is. Instead I plan a series of 
PRs:
   
   1. A  PR that introduces the `contains` API with tests
   2. A  second PR that refactors the the row group pruning code to use it
   3. A third PR that refactors the RowGroupPruning code so it runs on more 
than one RowGroupMetadatat at a time. 
   
   🤔 musing: At the moment, the row group pruning happens in two phases -- 
first the min/max statistics are applied, and then the bloom filters are 
applied which means the potentially bloom filters are only fetched for groups 
that passed the first predicate.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to