rdblue commented on a change in pull request #600: Implement in and notIn in 
multiple visitors
URL: https://github.com/apache/incubator-iceberg/pull/600#discussion_r355150533
 
 

 ##########
 File path: 
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
 ##########
 @@ -280,12 +281,33 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 
     @Override
     public <T> Boolean in(BoundReference<T> ref, Set<T> literalSet) {
-      return ROWS_MIGHT_MATCH;
+      int id = ref.fieldId();
+
+      Boolean hasNonDictPage = isFallback.get(id);
+      if (hasNonDictPage == null || hasNonDictPage) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+
+      return Sets.intersection(dictionary, literalSet).isEmpty() ? 
ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH;
     }
 
     @Override
     public <T> Boolean notIn(BoundReference<T> ref, Set<T> literalSet) {
-      return ROWS_MIGHT_MATCH;
+      int id = ref.fieldId();
+
+      Boolean hasNonDictPage = isFallback.get(id);
+      if (hasNonDictPage == null || hasNonDictPage) {
+        return ROWS_MIGHT_MATCH;
+      }
+
+      Set<T> dictionary = dict(id, ((BoundSetPredicate<T>) expr).comparator());
+      if (dictionary.size() > 1 || mayContainNulls.get(id)) {
 
 Review comment:
   The case where `dictionary.size() > 1` only helps for a `notEq` predicate, 
like `!= "a"`. In that case, this is checking whether all values in the column 
are `"a"`. If there is more than one value, like `dict = Set("a", "b")`, then 
at least one value in the column must be the other dictionary entry and not all 
values are equal to the excluded one (so the row group should be read).
   
   For a set of values, the equivalent logic is to check whether the dictionary 
size is greater than the number of literals in the `notIn` set. If so, then at 
least one value must not be in the set and the row group should be read.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to