rdblue commented on a change in pull request #4272:
URL: https://github.com/apache/iceberg/pull/4272#discussion_r820295897



##########
File path: docs/common/format/spec.md
##########
@@ -444,6 +444,8 @@ Notes:
 
 1. Technically, data files can be deleted when the last snapshot that contains 
the file as “live” data is garbage collected. But this is harder to detect and 
requires finding the diff of multiple snapshots. It is easier to track what 
files are deleted in a snapshot and delete them when that snapshot expires.
 2. Manifest list files are required in v2, so that the `sequence_number` and 
`snapshot_id` to inherit are always available.
+3. When a data file or delete file is marked as deleted in a manifest the 
writer of the new snapshot may not include a prexisting manifest that 
references the data file as EXISTING or ADDED. This implies that the manifest 
from the prior snapshot that including the newly deleted data file must not be 
in the new snapshot and that any files referenced by the manifest that are 
still part of the table must be written to a new

Review comment:
       I think we can have a stronger and clearer statement about this.
   
   Delete entries are mainly informational. Those entries are ignored when 
planning scans and are primarily for tracking files for incremental delete. 
Meaning that when you remove a snapshot, you can scan through the manifests 
that were written for that snapshot to find what was deleted in it. There are 
some issues with this approach, so we tend to recommend comparing reachable 
file sets to actually drop files. For example, we are reluctant to state in the 
spec that adding a deleted file back to the table after it has been deleted is 
now allowed. But that could cause incremental deletes to break valid snapshots.
   
   What I would state in the spec is that any given snapshot should not contain 
more than one ADDED or EXISTING entry for a file. That's what you're getting at 
in the update below, I think. We can also note that whether a file can be added 
to the table after it has been deleted is not recommended.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to