alamb opened a new issue, #4512:
URL: https://github.com/apache/arrow-datafusion/issues/4512

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Bloom filter support was added to arrow-rs in 28.0.0 (as part of 
https://github.com/apache/arrow-rs/issues/3023). Here is some of that 
background copy/pasted:
   
   There are usecases where one wants to search a large amount of parquet data 
for a relatively small number of rows. For example, if you have distributed 
tracing data stored as parquet files and want to find the data for a particular 
trace.
   
   In general, the pattern is "needle in a haystack type query" -- specifically 
a very selective predicate (passes on only a few rows) on high cardinality 
(many distinct values) columns. 
   
   Datafusion has fairly advanced support for 
   * row_group pruning
   * page index pruning
   * filter pushdown / late materialization
   
   These techniques are quite effective when data is sorted and large 
contiguous ranges of rows can be skipped.  However, doing needle in the 
haystack queries still often requires substantial amounts of CPU and IO 
   
   One challenge is that for typical high cardinality columns such as ids, they 
often (by design) span the entire range of values of the data type
   
   For example, given the best case when the data is "optimally sorted" by id 
within a row group,  min/max statistics can not help skip row groups or pages. 
Instead the entire column must be decoded to search for a particular value 
   
   ```
   ┌──────────────────────────┐                WHERE                 
   │            id            │       ┌─────── id = 54322342343      
   ├──────────────────────────┤       │                              
   │       00000000000        │       │                              
   ├──────────────────────────┤       │    Selective predicate on a  
   │       00054542543        │       │    high cardinality column   
   ├──────────────────────────┤       │                              
   │           ...            │       │                              
   ├──────────────────────────┤       │                              
   │        ??????????        │◀──────┘                              
   ├──────────────────────────┤          Can not rule out ranges     
   │           ...            │            using min/max values      
   ├──────────────────────────┤                                      
   │       99545435432        │                                      
   ├──────────────────────────┤                                      
   │       99999999999        │                                      
   └──────────────────────────┘                                      
                                                                     
     High cardinality column:                                        
       many distinct values                                          
             (sorted)                                                
                                                                     
   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                                           
      min: 00000000000                                               
   │  max: 99999999999   │                                           
                                                                     
   │       Metadata      │                                           
    ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                                                       
                              
   ```
   
   The parquet file format has support for bloom filters: 
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
   
   A bloom filter is a space efficient structure that allows determining if a 
value is not in a set quickly. So for a parquet file with bloom filters for 
`id` in the metadata, the entire row group can be skipped if the id is not 
present:
   
   
   ```
   ┌──────────────────────────┐                WHERE                
   │            id            │      ─ ─ ─ ─ ─ id = 54322342343     
   ├──────────────────────────┤     │                               
   │       00000000000        │           Can quickly check if      
   ├──────────────────────────┤     │    the value  54322342343     
   │       00054542543        │             is not present by       
   ├──────────────────────────┤     │     consulting the Bloom      
   │           ...            │                  Filter             
   ├──────────────────────────┤     │                               
   │        ??????????        │                                     
   ├──────────────────────────┤     │                               
   │           ...            │                                     
   ├──────────────────────────┤     │                               
   │       99545435432        │                                     
   ├──────────────────────────┤     │                               
   │       99999999999        │                                     
   └──────────────────────────┘     │                               
     High cardinality column:                                       
       many distinct values         │                               
             (sorted)                                               
                                    │                               
                                                                    
   ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─      │                               
                              │                                     
   │    bloom_filter: ....  ◀ ─ ─ ─ ┘                               
                              │                                     
   │  min: 00000000000                                              
      max: 99999999999        │                                     
   │                                                                
           Metadata           │                                     
   └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                                      
   ```
   
   
   **Describe the solution you'd like**
   I would like the ParquetReader in DataFusion to take advantage of Bloom 
filters when they are present.
   
   This would be in addition to 
[`page_filter`](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs)
 and 
[row_filter](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/file_format/parquet/row_filter.rs)
   
   Some high level steps are probably:
   - [ ] Add a config option like `OPT_PARQUET_PUSHDOWN_FILTERS`: 
https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/config.rs#L53
   - [ ]  Identify predicates that can be applied to bloom filters (e.g. `col = 
<constant>`)
   - [ ] Add a module that can read bloom filters and apply the predicates to 
rule out row groups (e.g. test for `<constant>` in the bloom filter for that 
column) in 
https://github.com/apache/arrow-datafusion/blob/34d9bb5e64e01e1baca4f636c855082f4cadc270/datafusion/core/src/physical_plan/file_format/parquet.rs#L481-L486
   - [ ] Add unit tests
   - [ ] Add basic integration tests in  
https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/parquet_exec.rs
   
   
   **Describe alternatives you've considered**
   Don't add support ?
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to