Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

via GitHub Mon, 25 Mar 2024 04:57:44 -0700


cccs-jc commented on issue #10029:
URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2017840320


   This weekend I fixed the issue with the three row-group filters not working 
together. The results are quite impressive 11 seconds vs 396.
   ```
   -----------------results-4CPU-OR-fixed------------------
   ('R:128MB', 'SRC', 11.399899959564209)
   ('R:128MB', 'DST', 0.7222678661346436)
   ('R:128MB', 'AND', 0.6006526947021484)
   ('R:128MB', 'OR', 11.477725505828857)
   
   -----------------results-4CPU------------------
   ('R:128MB', 'SRC', 13.441139459609985)
   ('R:128MB', 'DST', 1.1408600807189941)
   ('R:128MB', 'AND', 0.9586172103881836)
   ('R:128MB', 'OR', 396.9800181388855)
   ```
   
   What I did is implement a new `ParquetCombinedRowGroupFilter` which takes 
the ParquetMetricsRowGroupFilter, ParquetDictionaryRowGroupFilter, 
ParquetBloomRowGroupFilter and applies them like so
   
   ```java
       @Override
       public <T> Boolean eq(BoundReference<T> ref, Literal<T> lit) {
         return visitors.stream().allMatch(v -> v.eq(ref, lit) == 
ROWS_MIGHT_MATCH);
       }
   ```
   
   For every column it sequentially tests the metrics, dictionary and bloom if 
all of them return ROWS_MIGHT_MATCH then a shouldRead=True is returned.
   
   I still have to write a unit test to show that `OR` statement are applied 
properly now. I'll make a PR and we can compare notes.
   
   @zhongyujiang I'll have a look at your PR today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Bloom filter not properly leveraged when using an OR condition [iceberg]

Reply via email to