cccs-jc commented on issue #10029:
URL: https://github.com/apache/iceberg/issues/10029#issuecomment-2016525638
@huaxingao You are absolutely correct; the issue arises also when combining
the `statsFilter` with the `dictFilter`. It's essentially the same underlying
problem.
The crux of the issue lies in the evaluation of conditions such as `col1=1
OR col2=1`, where we're layering the complete evaluation performed by
`statsFilter`, with the evaluation from `dictFilter`, with the evaluation of
`bloomFilter`.
Ideally, for every column, we would want to combine the information from
stats, dict and bloom. This sounds like a "Chain of Responsibility."
So instead of the current implementation:
```java
statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter,
caseSensitive);
dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter,
caseSensitive);
bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter,
caseSensitive);
boolean shouldRead =
filter == null
|| (statsFilter.shouldRead(typeWithIds, rowGroup)
&& dictFilter.shouldRead(
typeWithIds, rowGroup,
reader.getDictionaryReader(rowGroup))
&& bloomFilter.shouldRead(
typeWithIds, rowGroup,
reader.getBloomFilterDataReader(rowGroup)));
```
I propose the following:
```java
statsFilter = new ParquetMetricsRowGroupFilter(expectedSchema, filter,
caseSensitive);
dictFilter = new ParquetDictionaryRowGroupFilter(expectedSchema, filter,
caseSensitive);
bloomFilter = new ParquetBloomRowGroupFilter(expectedSchema, filter,
caseSensitive);
dictFilter.setDict(reader.getDictionaryReader(rowGroup));
bloomFilter.setBloom(reader.getBloomFilterDataReader(rowGroup));
// Chain statsFilter -> bloomFilter -> dictFilter
bloomFilter.setNext(dictFilter);
statsFilter.setNext(bloomFilter);
boolean shouldRead =
filter == null
|| statsFilter.shouldRead(typeWithIds, rowGroup);
```
In this setup, when the `statsFilter` evaluates an expression like `eq` and
determines that `ROWS_CANNOT_MATCH`, it returns immediately. However, if it
determines `ROWS_MIGHT_MATCH`, it passes on the request to the next filter
(`bloomFilter`). The `bloomFilter` then tests against the bloom information. If
it determines `ROWS_CANNOT_MATCH`, it returns immediately. But if it determines
`ROWS_MIGHT_MATCH`, it then asks the next filter (`dictFilter`).
This approach effectively combines all the information (metrics, bloom,
dict) for a given `col1` and accurately determines whether `ROWS_CANNOT_MATCH`
or `ROWS_MIGHT_MATCH` for `col1=1`.
Basically the chain is like this: `statsFilter.eq(col1=1) ->
bloomFilter.eq(col1=1) -> dictFilter.eq(col1=1)`
It does the same with `col2=1`.
Finally, the results for every column (`col1`, `col2`) can be combined using
the `public Boolean or(Boolean leftResult, Boolean rightResult)` method. The
`or` method does not need to be delegated since it does not tests against
metrics, dict or bloom.
Does this seem reasonable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]