rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677025572
##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE,
DataOperations.DELETE);
private static final Set<String>
VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+ // delete files are only added in "overwrite" operations
+ private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+ ImmutableSet.of(DataOperations.OVERWRITE);
Review comment:
I've considered this before and keep coming back to not splitting the
existing operations. Right now, the operations convey what's happening to the
table at a logical level. An append never removes rows, a delete never adds
rows, and a replace never changes rows. Then overwrite is used for anything
that adds and removes rows in the same commit. I think those definitions are
easy to understand and reason about.
If we added an operation for deltas, then we would either add a single
"delta" operation or "delta-delete", "delta-overwrite", and "delta-replace"
operations. If we use just "delta" then it wouldn't be clear what was happening
logically, so we'd lose the ability to ignore a row-level deletes in a logical
replace/rewrite. So I would conclude that we can either introduce 3 new
operations or no new operations.
So far, I don't think there's a strong justification for adding the 3 new
operations because the existing ones already cover what you're doing to the
data logically. We would only be adding more information about how that logical
operation is carried out with files. There may be some value there, but I'd
probably instead add a new property to signal that row-level deletes were used
so that we don't alter the contract of `DataOperation`.
Then the question becomes whether it is useful to know that row-level
deletes are present. If so, we can add a property for it. In this case, it
would save us some time reading the manifest list for an overwrite or delete.
Do you think we should add something to signal this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]