aokolnychyi commented on pull request #3578: URL: https://github.com/apache/iceberg/pull/3578#issuecomment-974578983
@rdblue, okay, I missed that we don't follow the SQL semantics in Iceberg expressions. It was slightly surprising since we are offering SQL tables but I definitely agree the null handling in SQL is confusing. I'll try to wrap my head around this. To give more context, I started looking into this as the following metadata DELETE gave a wrong result. Suppose we have a single file with 2 records. ``` id -- 2 null ``` And someone issues a DELETE command. ``` DELETE FROM t WHERE id NOT IN (1, 10) ``` The expected outcome in SQL is to remove `id = 2` and keep the record with `null`. That's why it cannot be a metadata delete in Iceberg and we have to rewrite the file. However, that's not what happens now. We currently think that all records in the file match the condition and delete the entire file, meaning that we delete records we were not supposed to delete. If Iceberg expressions handle nulls differently by design, then we have to fix `SparkFilters`. Do I understand correctly that a filter from Spark like `col NOT IN (1, 2)` has to be translated into `notNull(col) && notIn(col, 1, 2)` in Iceberg? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
