aokolnychyi commented on pull request #3578:
URL: https://github.com/apache/iceberg/pull/3578#issuecomment-974578983


   @rdblue, okay, I missed that we don't follow the SQL semantics in Iceberg 
expressions. It was slightly surprising since we are offering SQL tables but I 
definitely agree the null handling in SQL is confusing. I'll try to wrap my 
head around this.
   
   To give more context, I started looking into this as the following metadata 
DELETE gave a wrong result.
   
   Suppose we have a single file with 2 records.
   
   ```
   id
   --
   2
   null
   ```
   
   And someone issues a DELETE command.
   
   ```
   DELETE FROM t WHERE id NOT IN (1, 10)
   ```
   
   The expected outcome in SQL is to remove `id = 2` and keep the record with 
`null`. That's why it cannot be a metadata delete in Iceberg and we have to 
rewrite the file. However, that's not what happens now. We currently think that 
all records in the file match the condition and delete the entire file, meaning 
that we delete records we were not supposed to delete.
   
   If Iceberg expressions handle nulls differently by design, then we have to 
fix `SparkFilters`. Do I understand correctly that a filter from Spark like 
`col NOT IN (1, 2)` has to be translated into `notNull(col) && notIn(col, 1, 
2)` in Iceberg?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to