yyanyy commented on a change in pull request #1747:
URL: https://github.com/apache/iceberg/pull/1747#discussion_r521837877



##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -150,6 +152,53 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
       return ROWS_MIGHT_MATCH;
     }
 
+    @Override
+    public <T> Boolean isNaN(BoundReference<T> ref) {
+      Integer id = ref.fieldId();
+
+      if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) 
== 0) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      // when there's no nanCounts information, but we already know the column 
only contains null,
+      // it's guaranteed that there's no NaN value
+      if (containsNullsOnly(id)) {

Review comment:
       I didn't define `notNaN` originally as I could directly return 
`ROWS_CANNOT_MATCH` when both `nanCounts` and `valueCounts` contain this column 
but numbers don't match, without going into the next block of logic (of 
checking upper == lower == NaN and null count == 0); but this advantage no 
longer exists since that block needs to be removed. 
   
   But I wasn't sure if we need it for `isNull`: currently in `isNull()` we are 
checking if `nullCounts == 0` to return `ROWS_CANNOT_MATCH`, and I guess the 
only chance where we rely on `containsNaNsOnly` to return `ROWS_CANNOT_MATCH` 
is `nullCounts` for this column doesn't exist but `nanCounts` does. I 
personally feel the chance of this happening would be small, do you think we 
will run into this case often?

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -150,6 +152,53 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
       return ROWS_MIGHT_MATCH;
     }
 
+    @Override
+    public <T> Boolean isNaN(BoundReference<T> ref) {
+      Integer id = ref.fieldId();
+
+      if (nanCounts != null && nanCounts.containsKey(id) && nanCounts.get(id) 
== 0) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      // when there's no nanCounts information, but we already know the column 
only contains null,
+      // it's guaranteed that there's no NaN value
+      if (containsNullsOnly(id)) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      return ROWS_MIGHT_MATCH;
+    }
+
+    @Override
+    @SuppressWarnings("checkstyle:CyclomaticComplexity")
+    public <T> Boolean notNaN(BoundReference<T> ref) {
+      Integer id = ref.fieldId();
+
+      if (nanCounts != null && nanCounts.containsKey(id) &&
+          valueCounts != null && valueCounts.containsKey(id)) {
+        if (nanCounts.get(id).equals(valueCounts.get(id))) {
+          return ROWS_CANNOT_MATCH;
+        }
+
+        return ROWS_MIGHT_MATCH;
+      }
+
+      // for v1 table, when NaN could still be upper/lower bound,
+      // if upper == lower == NaN and null count == 0, the column will only 
contain NaN

Review comment:
       I totally forgot that, thanks for reminding this! Will remove this 
logic. 

##########
File path: api/src/main/java/org/apache/iceberg/expressions/Expressions.java
##########
@@ -123,6 +123,22 @@ public static Expression not(Expression child) {
     return new UnboundPredicate<>(Expression.Operation.NOT_NULL, expr);
   }
 
+  public static <T> UnboundPredicate<T> isNaN(String name) {
+    return new UnboundPredicate<>(Expression.Operation.IS_NAN, ref(name));
+  }
+
+  public static <T> UnboundPredicate<T> isNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.IS_NAN, expr);
+  }
+
+  public static <T> UnboundPredicate<T> notNaN(String name) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, ref(name));
+  }
+
+  public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
+  }

Review comment:
       I originally thought to update `SparkFilters` to do the rewrite, but 
this is a much better place. Thanks for the suggestion!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to