yyanyy commented on a change in pull request #1872:
URL: https://github.com/apache/iceberg/pull/1872#discussion_r570609722



##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -144,18 +143,37 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
     @Override
     public <T> Boolean isNaN(BoundReference<T> ref) {
       int pos = Accessors.toPosition(ref.accessor());
-      // containsNull encodes whether at least one partition value is null, 
lowerBound is null if
-      // all partition values are null.
-      if (stats.get(pos).containsNull() && stats.get(pos).lowerBound() == 
null) {
-        return ROWS_CANNOT_MATCH; // all values are null
+
+      if (stats.get(pos).containsNaN() != null && 
!stats.get(pos).containsNaN()) {
+        return ROWS_CANNOT_MATCH;
+      }
+
+      if (allValuesAreNull(stats.get(pos))) {
+        return ROWS_CANNOT_MATCH;
       }
 
       return ROWS_MIGHT_MATCH;
     }
 
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // Before introducing containsNaN field, containsNull encodes whether at 
least one partition value is null,
+      // lowerBound is null if all partition values are null.
+      // After introducing containsNaN field, containsNaN must be false to 
ensure all values are null since bounds
+      // don't include NaN anymore.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          (summary.containsNaN() == null || !summary.containsNaN());

Review comment:
       I think the change for excluding NaN in `lower`/`upper` and adding 
`containsNaN` both belong to this PR, so if a release contains this change, 
then it would either be (1) `NaN` is part of `lower`/`upper` and `containsNaN` 
is missing, or (2) `containsNaN` exists and `lower`/`upper` doesn't store 
`NaN`. But I guess people may implement their own manifest summary that already 
exclude `NaN` from bounds but no `containsNaN`, so we still want to handle 
this, and file level metrics could give more granular information so there 
isn't necessarily any performance penalty. I have updated this PR to check for 
existence of `containsNaN`, but please let me know if my understanding isn't 
correct! 

##########
File path: 
api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java
##########
@@ -329,5 +338,12 @@ public Boolean or(Boolean leftResult, Boolean rightResult) 
{
 
       return ROWS_MIGHT_MATCH;
     }
+
+    private boolean allValuesAreNull(PartitionFieldSummary summary) {
+      // containsNull encodes whether at least one partition value is null, 
lowerBound is null if all partition values
+      // are null; in case bounds don't include NaN value, containsNaN needs 
to be checked against.
+      return summary.containsNull() && summary.lowerBound() == null &&
+          summary.containsNaN() != null && !summary.containsNaN();

Review comment:
       Yes, sorry I forgot to address this comment... Will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to