[GitHub] [iceberg] rdblue commented on pull request #1318: Add equality field IDs to DeleteFile

GitBox Tue, 11 Aug 2020 13:13:46 -0700


rdblue commented on pull request #1318:
URL: https://github.com/apache/iceberg/pull/1318#issuecomment-672255352



   > Can these additional fields be nulls?
   
   Yes. In fact, the fields can be omitted from the delete file entirely. There 
is no requirement to have any fields other than the ones in the delete ID set. 
We may also want to build a way to pass a row to get stats for the delete file, 
but not actually store the deleted values. But I think optimizations like this 
can be done in follow-up PRs.
   
   > Why equality field IDs in DeleteFile?
   
   When you encode a delete, the equality columns are fixed. As Anton noted, 
the columns you use may change as the table schema evolves. In those cases, we 
need to make sure that each delete uses equality columns from when it was 
committed. Storing them in `DeleteFile` metadata is a good way to do that.
   
   In addition to being able to change the column set for schema evolution, 
this supports being able to delete by different columns if needed. For example, 
a GDPR deletes may come in by email address or by user name and we want to 
support encoding both without scanning to find out the file/position to delete.
   
   Lastly, we could store these just in file metadata. But keeping them at the 
table level enables a few things. First, we can group files that use the same 
delete fields into the same filter set. Second, we can use them at job planning 
time to check whether a data file and a delete file overlap using delete column 
stats. But we can only do that if we know which columns are used to delete at 
job planning time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1318: Add equality field IDs to DeleteFile

Reply via email to