alamb opened a new issue, #8436: URL: https://github.com/apache/arrow-datafusion/issues/8436
### Is your feature request related to a problem or challenge? BloomFilter support was added in https://github.com/apache/arrow-datafusion/pull/7821 by @hengfeiyang ❤️ There is partial support for optimizing queries that have `IN` List predicates,. as suggested by @Ted-Jiang : https://github.com/apache/arrow-datafusion/pull/7821#discussion_r1366398648 and tested via https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084 However, this only supports queries where there are three or fewer items in the IN list: ``` SELECT * FROM parquet_file WHERE col IN ('foo', 'bar', 'baz') ``` It only works for small numbers of constants because the current implementation only checks for predicates like `col = 'foo' OR col = 'bar'`. The reason this works for `InList`s is that with small numbers of items ( `3`) are rewritten to `OR` chains) by this code in the optimizer: https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L500-L549 Thus, the the current bloom filter code will not work for queries with large numbers (more than the `THRESHOLD_INLINE_INLIST`) of constants in the `IN` list, such as ```sql SELECT * FROM parquet_file WHERE col IN ( 'constant1', 'constant2', .., 'constant99', 'constant100', ) ``` ### Describe the solution you'd like I would like the bloom filter code to directly support `InListExpr` and thus also support `IN` / `NOT IN` queries with large numbers of constants In terms of implementation, after XXX is merged, this should be a straightforward matter of: 1. Adding support in `LiteralGurantee` code (TODO link) 2. Add test for LiteralGurantee 3. Add a integration test for Bloom filters in YYY ### Describe alternatives you've considered _No response_ ### Additional context Found while I was working on https://github.com/apache/arrow-datafusion/issues/8376 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
