rdblue commented on a change in pull request #600: Implement in and notIn in 
multiple visitors
URL: https://github.com/apache/incubator-iceberg/pull/600#discussion_r355150781
 
 

 ##########
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
 ##########
 @@ -280,12 +281,33 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 
     @Override
     public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {
-      return ROWS_MIGHT_MATCH;
+      int id = ref.fieldId();
+
+      Boolean hasNonDictPage = isFallback.get(id);
+      if (hasNonDictPage == null || hasNonDictPage) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+
+      return Sets.intersection(dictionary, literalSet).isEmpty() ? 
ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH;
     }
 
     @Override
     public <T> Boolean notIn(BoundReference<T> ref, Set<T> literalSet) {
-      return ROWS_MIGHT_MATCH;
+      int id = ref.fieldId();
+
+      Boolean hasNonDictPage = isFallback.get(id);
+      if (hasNonDictPage == null || hasNonDictPage) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+      if (dictionary.size() > 1 || mayContainNulls.get(id)) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      return Sets.intersection(dictionary, literalSet).isEmpty() ? 
ROWS_MIGHT_MATCH : ROWS_CANNOT_MATCH;
 
 Review comment:
   If any value in the dictionary is not in the set, then a row has a value 
that matches the `notIn` predicate. This tests whether all values of the 
dictionary are not in the set (the intersection is empty). That's not correct. 
For example, if the dictionary is `["a", "b", "c"]` and the predicate is 
`notIn("a", "b", "d")`, this would return `ROWS_CANNOT_MATCH` because the 
intersection is `Set("a", "b")`. But, at least one row has value `"c"` that is 
not in the set, so rows do match.
   
   I think this should be a set difference operation. If 
`Sets.difference(dictionary, literalSet).isEmpty()` then there are no values in 
the dictionary that are not also in the set. All values are blacklisted so 
there are no values matching the predicate and this can return 
`ROWS_CANNOT_MATCH` in that case. Otherwise, rows might match.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to