alamb opened a new issue, #8376:
URL: https://github.com/apache/arrow-datafusion/issues/8376

   ### Is your feature request related to a problem or challenge?
   
   In IOx in certain cases we know that a "container" (parquet file, or set of 
record batches) has only a single value for some computed quantity (in our case 
`hash(column) % N` for some constant `N` (like 100).
   
   For this case, we want to be able to quickly determine, given an arbitrary 
predicate, if that container could not possible contain the value.
   
   So for example, if we know the container has `hash(column) % 100 = 27`, 
given a predicate that includes an expression like `column = 'foo'`, we can 
compute the quantity `hash(column) % 100` and if it is not `27` we can skip the 
entire container.
   
   We could implement this directly in our codebase, and we may do so 
temporarily. However, I think the usecase is common enough that we would like 
to improve the support upstream in DataFusion so others can both benefit and 
help optimize for it.
   
   For example, applying BloomFilters to prune out parquet row groups (added in 
https://github.com/apache/arrow-datafusion/pull/7821 by @hengfeiyang) has the 
same pattern. 
   
   Since we also have other information such as min/max and null counts for 
certain columns that we prune using 
[`PruningPredicate`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html),
 having this ability be part of `PruningPredicate` is compelling
   
   ### Describe the solution you'd like
   
   
   DataFusion's 
[`PruningPredicate`](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html)
  can already use information on ranges (min/max values). I would like to 
extend its capabulities to include taking advantage of aprior knowledge about 
certain specific values to take advantage of `column = <constant>` predicates.
   
   I propose we extend 
[PruningStatistics](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/trait.PruningStatistics.html)
 with some way to pass knowledge on about the contents of data structures like 
Bloom Filters. For example:
   
   ```rust
   pub trait PruningStatistics {
       // PROPOSED ADDITIONS
   
       // Returns an array where each element is
       // * `true` if the value of column CERTAINLY DOES contain `value`
       // * `false` if the value of column CERTAINLY DOES NOT contain `value`
       // * `null` if the value of column may or may not contain the value
       fn contains(&self, column: &Column, value: &ScalarValue) -> 
Result<Option<BooleanArray>>;
   
       // EXISTING METHODS
       fn min_values(&self, column: &Column) -> Option<ArrayRef>;
       fn max_values(&self, column: &Column) -> Option<ArrayRef>;
       fn num_containers(&self) -> usize;
       fn null_counts(&self, column: &Column) -> Option<ArrayRef>;
   }
   ```
   
   We could then implement the bloom filter pruning in DataFusion with this API 
as well as use the same thing for our downstream usecase
   
   ## Example for equality predicate `col = 'foo'`
   
   The PruningPredicate would call 
   ```
   PruningStatistics::contains(Column(col), 'foo')
   ``` 
   
   and could prune all containers that returned `false` 
   
   ## Example for inequality predicate `col != 'foo'`
   
   The PruningPredicate would call 
   ```
   PruningStatistics::contains(Column(col), 'foo')
   ``` 
   
   and could prune all containers that returned `true` 
   
   **Note** I don't think the `contains` API could be used for other inequality 
predicates like `col < 'foo'` for example. The existing min/max statistics 
would have to be used
   
   
   ### Describe alternatives you've considered
   
   I also thought about trying to rewrite equality predicates to take advantage 
of the existing min/max statistics (which can represent where a column has only 
a single value). However, that API doesn't allow for information like Bloom 
filters which simply can say for sure if the value may be present  or not.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to