rdblue commented on pull request #1318: URL: https://github.com/apache/iceberg/pull/1318#issuecomment-672255352
> Can these additional fields be nulls? Yes. In fact, the fields can be omitted from the delete file entirely. There is no requirement to have any fields other than the ones in the delete ID set. We may also want to build a way to pass a row to get stats for the delete file, but not actually store the deleted values. But I think optimizations like this can be done in follow-up PRs. > Why equality field IDs in DeleteFile? When you encode a delete, the equality columns are fixed. As Anton noted, the columns you use may change as the table schema evolves. In those cases, we need to make sure that each delete uses equality columns from when it was committed. Storing them in `DeleteFile` metadata is a good way to do that. In addition to being able to change the column set for schema evolution, this supports being able to delete by different columns if needed. For example, a GDPR deletes may come in by email address or by user name and we want to support encoding both without scanning to find out the file/position to delete. Lastly, we could store these just in file metadata. But keeping them at the table level enables a few things. First, we can group files that use the same delete fields into the same filter set. Second, we can use them at job planning time to check whether a data file and a delete file overlap using delete column stats. But we can only do that if we know which columns are used to delete at job planning time. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
