[GitHub] [iceberg] yyanyy commented on pull request #1747: API: add isNaN and notNaN predicates

GitBox Mon, 09 Nov 2020 18:08:20 -0800


yyanyy commented on pull request #1747:
URL: https://github.com/apache/iceberg/pull/1747#issuecomment-724402711



   (not related to this change itself) 
   I was thinking how we should change metrics evaluators when we exclude NaN 
from upper/lower bounds. Here's a table I summarized about the changes we have 
to make:
   
   |                    | strict evaluator          | inclusive evaluator       
|
   
|--------------------|---------------------------|---------------------------|
   | ... where id lteq/gteq NaN |need to check if we are comparing with NaN, 
and check NaN counter if min==max==null |need to check if we are comparing with 
NaN, and check NaN counter |
   |... where id = V, V != NaN, and column contains some NaN | if min==max==V, 
need to check if there are null/NaN count to decide if `ROWS_MUST_MATCH` should 
be returned | no change |
   |... where id lt/lteq/gt/gteq V, V != NaN and column contains some NaN |if 
there are NaN count, return `ROWS_MIGHT_NOT_MATCH` (this may result in v2 
returning more files than v1)| if there are NaN count, return 
`ROWS_MIGHT_MATCH` (this may result in v2 returning more files than v1)|
   |... where id lt/lteq/gt/gteq V, V != NaN, and column contains only NaN| no 
change | if there are null/NaN count, return `ROWS_MIGHT_MATCH` (this may 
result in v2 returning more files than v1)
   
   Here's an example for the explanation of "this may result in v2 returning 
more files than v1": say in v1 we consistently treat NaN as lower bound when 
there's any NaN value, and a file has stats distributed as below:
   ```NaN-------<actual min>----<actual max>-----V-----```
   in v1 and without NaN counter, query `where x > V` will not return this file 
since it's outside the bound; however in v2 we will return it, since we don't 
know how to compare with NaN. 
   
   Another change is "in". In v2 we may need to explicitly check if there's NaN 
value in `literalSet` when comparing with lower/upper bound. 
   
   Do statements I made above look right? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on pull request #1747: API: add isNaN and notNaN predicates

Reply via email to