alamb opened a new issue, #8436:
URL: https://github.com/apache/arrow-datafusion/issues/8436

   ### Is your feature request related to a problem or challenge?
   
   BloomFilter support was added in 
https://github.com/apache/arrow-datafusion/pull/7821 by @hengfeiyang ❤️ 
   
   There is partial support for optimizing queries that have `IN` List 
predicates,.  as suggested by @Ted-Jiang : 
https://github.com/apache/arrow-datafusion/pull/7821#discussion_r1366398648 and 
tested via 
https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084
   
   However, this only supports queries where there are three or fewer items in 
the IN list:
   
   ```
   SELECT * 
   FROM parquet_file 
   WHERE col IN ('foo', 'bar', 'baz')
   ```
   
   It only works for small numbers of constants because the current 
implementation only checks for predicates like `col = 'foo' OR col = 'bar'`. 
The reason this works for `InList`s is that with small numbers of items ( `3`) 
are rewritten to `OR` chains) by this code in the optimizer:
   
   
https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L500-L549
   
   Thus, the the current bloom filter code will not work for queries with large 
numbers (more than the `THRESHOLD_INLINE_INLIST`)  of constants in the `IN` 
list, such as
   
   ```sql
   SELECT * 
   FROM parquet_file 
   WHERE col IN (
     'constant1',
     'constant2',
     ..,
     'constant99',
     'constant100',
   )
   ```
   
   ### Describe the solution you'd like
   
   I would like the bloom filter code to directly support `InListExpr` and thus 
also support `IN` / `NOT IN` queries with large numbers of constants
   
   In terms of implementation, after XXX is merged, this should be a 
straightforward matter of:
   1. Adding support in `LiteralGurantee` code (TODO link)
   2. Add test for LiteralGurantee
   3. Add a integration test for Bloom filters in YYY
   
   
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   Found while I was working on 
https://github.com/apache/arrow-datafusion/issues/8376


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to