rdblue commented on a change in pull request #600: Implement in and notIn in
multiple visitors
URL: https://github.com/apache/incubator-iceberg/pull/600#discussion_r355150533
##########
File path:
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
##########
@@ -280,12 +281,33 @@ public Boolean or(Boolean leftResult, Boolean
rightResult) {
@Override
public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {
- return ROWS_MIGHT_MATCH;
+ int id = ref.fieldId();
+
+ Boolean hasNonDictPage = isFallback.get(id);
+ if (hasNonDictPage == null || hasNonDictPage) {
+ return ROWS_MIGHT_MATCH;
+ }
+
+ Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+
+ return Sets.intersection(dictionary, literalSet).isEmpty() ?
ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH;
}
@Override
public <T> Boolean notIn(BoundReference<T> ref, Set<T> literalSet) {
- return ROWS_MIGHT_MATCH;
+ int id = ref.fieldId();
+
+ Boolean hasNonDictPage = isFallback.get(id);
+ if (hasNonDictPage == null || hasNonDictPage) {
+ return ROWS_MIGHT_MATCH;
+ }
+
+ Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+ if (dictionary.size() > 1 || mayContainNulls.get(id)) {
Review comment:
The case where `dictionary.size() > 1` only helps for a `notEq` predicate,
like `!= "a"`. In that case, this is checking whether all values in the column
are `"a"`. If there is more than one value, like `dict = Set("a", "b")`, then
at least one value in the column must be the other dictionary entry and not all
values are equal to the excluded one (so the row group should be read).
For a set of values, the equivalent logic is to check whether the dictionary
size is greater than the number of literals in the `notIn` set. If so, then at
least one value must not be in the set and the row group should be read.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]