rdblue commented on a change in pull request #1747:
URL: https://github.com/apache/iceberg/pull/1747#discussion_r529813847



##########
File path: api/src/main/java/org/apache/iceberg/expressions/Expressions.java
##########
@@ -123,6 +123,22 @@ public static Expression not(Expression child) {
     return new UnboundPredicate<>(Expression.Operation.NOT_NULL, expr);
   }
 
+  public static <T> UnboundPredicate<T> isNaN(String name) {
+    return new UnboundPredicate<>(Expression.Operation.IS_NAN, ref(name));
+  }
+
+  public static <T> UnboundPredicate<T> isNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.IS_NAN, expr);
+  }
+
+  public static <T> UnboundPredicate<T> notNaN(String name) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, ref(name));
+  }
+
+  public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
+  }

Review comment:
       > If I understand correctly, to reject NaN in any predicate sounds like 
we might go back to the idea of rewriting equals in SparkFilters
   
   Yes. If the engine generally uses `d = NaN` then we can convert that to 
`isNaN`. But that would be engine-dependent and the Iceberg expression API 
would not support equals with NaN.
   
   > are we able to rely on engine to do this check before translating query to 
Expression?
   
   I think so. Most engines will optimize the SQL expressions and handle this 
already. If not, then it would result in an exception from Iceberg to the user. 
I think that's okay, too, because as I said above, we want to fail if a NaN is 
used in an expression with a non-floating-point column, not rewrite to false.
   
   > And seems like this may only impact eq as we decided to do input 
validation on other lg/lteq/gt/gteq and in anyway?
   
   Yes. This makes all of the handling in `Expressions` consistent: always 
reject NaN values.
   
   > that may sound backward incompatible until the engine starts to rewrite 
NaN?
   
   I'm not convinced either way. You could argue that `d = NaN` is ambiguous 
and that rejecting it is now fixing a bug. That's certainly the case with `d > 
NaN`, which is not defined. On the other hand, there was _some_ bevhavior 
before that will now no longer work. So I'd be up for fixing this in Flink and 
Spark conversions as soon as we can.
   
   Feel free to ping me on Slack!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to