rdblue commented on a change in pull request #4272:
URL: https://github.com/apache/iceberg/pull/4272#discussion_r820295897
##########
File path: docs/common/format/spec.md
##########
@@ -444,6 +444,8 @@ Notes:
1. Technically, data files can be deleted when the last snapshot that contains
the file as “live” data is garbage collected. But this is harder to detect and
requires finding the diff of multiple snapshots. It is easier to track what
files are deleted in a snapshot and delete them when that snapshot expires.
2. Manifest list files are required in v2, so that the `sequence_number` and
`snapshot_id` to inherit are always available.
+3. When a data file or delete file is marked as deleted in a manifest the
writer of the new snapshot may not include a prexisting manifest that
references the data file as EXISTING or ADDED. This implies that the manifest
from the prior snapshot that including the newly deleted data file must not be
in the new snapshot and that any files referenced by the manifest that are
still part of the table must be written to a new
Review comment:
I think we can have a stronger and clearer statement about this.
Delete entries are mainly informational. Those entries are ignored when
planning scans and are primarily for tracking files for incremental delete.
Meaning that when you remove a snapshot, you can scan through the manifests
that were written for that snapshot to find what was deleted in it. There are
some issues with this approach, so we tend to recommend comparing reachable
file sets to actually drop files. For example, we are reluctant to state in the
spec that adding a deleted file back to the table after it has been deleted is
now allowed. But that could cause incremental deletes to break valid snapshots.
What I would state in the spec is that any given snapshot should not contain
more than one ADDED or EXISTING entry for a file. That's what you're getting at
in the update below, I think. We can also note that whether a file can be added
to the table after it has been deleted is not recommended.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]