[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

GitBox Wed, 20 Jan 2021 09:01:33 -0800


rdblue commented on a change in pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#discussion_r559044025




##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
##########
@@ -204,15 +210,20 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
     public <T> Boolean ltEq(BoundReference<T> ref, Literal<T> lit) {
       Integer id = ref.fieldId();
 
-      if (containsNullsOnly(id)) {
+      if (containsNullsOnly(id) || containsNaNsOnly(id)) {
         return ROWS_CANNOT_MATCH;
       }
 
       if (lowerBounds != null && lowerBounds.containsKey(id)) {
         T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
 
         int cmp = lit.comparator().compare(lower, lit.value());
-        if (cmp > 0) {
+
+        // Due to the comparison implementation of ORC stats, for float/double 
columns in ORC files,

Review comment:
       I don't that there is a need for an extra method that has just one 
method call. I'd probably do it like this:
   
   ```java
           T lower = Conversions.fromByteBuffer(ref.type(), 
lowerBounds.get(id));
           if (NaNUtil.isNaN(lower)) {
             // NaN indicates unreliable bounds. See the 
InclusiveMetricsEvaluator docs for more.
             return ROWS_MIGHT_MATCH;
           }
   
           int cmp = lit.comparator().compare(lower, lit.value());
           if (cmp > 0) {
             return ROWS_CANNOT_MATCH;
           }
   ```
   
   The docs would go in the javadoc for the whole class, and each NaN check 
could simply refer back to it. I also moved the NaN check above the comparison 
to keep the logic simple: if the value is NaN, the bound is invalid.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #2069: API: handle NaN as min/max stats in evaluators

Reply via email to