yyanyy commented on pull request #1747: URL: https://github.com/apache/iceberg/pull/1747#issuecomment-724402711
(not related to this change itself) I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table I summarized about the changes we have to make: | | strict evaluator | inclusive evaluator | |--------------------|---------------------------|---------------------------| | ... where id lteq/gteq NaN |need to check if we are comparing with NaN, and check NaN counter if min==max==null |need to check if we are comparing with NaN, and check NaN counter | |... where id = V, V != NaN, and column contains some NaN | if min==max==V, need to check if there are null/NaN count to decide if `ROWS_MUST_MATCH` should be returned | no change | |... where id lt/lteq/gt/gteq V, V != NaN and column contains some NaN |if there are NaN count, return `ROWS_MIGHT_NOT_MATCH` (this may result in v2 returning more files than v1)| if there are NaN count, return `ROWS_MIGHT_MATCH` (this may result in v2 returning more files than v1)| |... where id lt/lteq/gt/gteq V, V != NaN, and column contains only NaN| no change | if there are null/NaN count, return `ROWS_MIGHT_MATCH` (this may result in v2 returning more files than v1) Here's an example for the explanation of "this may result in v2 returning more files than v1": say in v1 we consistently treat NaN as lower bound when there's any NaN value, and a file has stats distributed as below: ```NaN-------<actual min>----<actual max>-----V-----``` in v1 and without NaN counter, query `where x > V` will not return this file since it's outside the bound; however in v2 we will return it, since we don't know how to compare with NaN. Another change is "in". In v2 we may need to explicitly check if there's NaN value in `literalSet` when comparing with lower/upper bound. Do statements I made above look right? Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
