yyanyy commented on issue #2492: URL: https://github.com/apache/iceberg/issues/2492#issuecomment-822842065
Sorry to make sure I understand, are you suggesting if both upper/lower bounds are NaN, the column will only contain NaN so that we can return false? I think it's possible that a column may contain a mix of NaN and normal values but still report min/max as both NaN, which is what ORC files metric collection does today: https://github.com/apache/iceberg/blob/master/core/src/test/java/org/apache/iceberg/TestMetrics.java#L373-L374 But after thinking about it for a bit, I think it may not worth adding this logic to avoid including unnecessary files, as 1) the chance might be low if people to do their own implementation on it since implementing manifest list may be non-trivial, 2) it's not a new thing we introduced, and 3) we start to mention in spec that upper/lower shouldn't include NaN: https://iceberg.apache.org/spec/#snapshots: ``` 510 lower_bound ... Lower bound for the non-null, non-NaN values in the partition field, or null if all values are null or NaN [2] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
