cccs-jc commented on issue #10029:
URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2017840320
This weekend I fixed the issue with the three row-group filters not working
together. The results are quite impressive 11 seconds vs 396.
```
-----------------results-4CPU-OR-fixed------------------
('R:128MB', 'SRC', 11.399899959564209)
('R:128MB', 'DST', 0.7222678661346436)
('R:128MB', 'AND', 0.6006526947021484)
('R:128MB', 'OR', 11.477725505828857)
-----------------results-4CPU------------------
('R:128MB', 'SRC', 13.441139459609985)
('R:128MB', 'DST', 1.1408600807189941)
('R:128MB', 'AND', 0.9586172103881836)
('R:128MB', 'OR', 396.9800181388855)
```
What I did is implement a new `ParquetCombinedRowGroupFilter` which takes
the ParquetMetricsRowGroupFilter, ParquetDictionaryRowGroupFilter,
ParquetBloomRowGroupFilter and applies them like so
```java
@Override
public <T> Boolean eq(BoundReference<T> ref, Literal<T> lit) {
return visitors.stream().allMatch(v -> v.eq(ref, lit) ==
ROWS_MIGHT_MATCH);
}
```
For every column it sequentially tests the metrics, dictionary and bloom if
all of them return ROWS_MIGHT_MATCH then a shouldRead=True is returned.
I still have to write a unit test to show that `OR` statement are applied
properly now. I'll make a PR and we can compare notes.
@zhongyujiang I'll have a look at your PR today.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]