JingsongLi commented on pull request #1318: URL: https://github.com/apache/iceberg/pull/1318#issuecomment-671826837
Hi @rdblue , thanks for your work, these two PRs look very good~ I have two comments: ## Optimization for Upsert data Considering that upsert data will write insert file and delete file at the same time, this can double the storage. I'm thinking about some scenarios: -For example, the downstream does not need to restore CDC data stream. -For example, downstream engines only need PKs(equality field IDs) for delete records. How can we reduce storage in these scenarios? Can these additional fields be nulls? ## Why equality field IDs in `DeleteFile`? Why not just primary keys definition for table? Will equality field IDs be different between files? It can be used as schema evolution? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
