rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r677025572



##########
File path: core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java
##########
@@ -62,6 +63,9 @@
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE, 
DataOperations.DELETE);
   private static final Set<String> 
VALIDATE_DATA_FILES_EXIST_SKIP_DELETE_OPERATIONS =
       ImmutableSet.of(DataOperations.OVERWRITE, DataOperations.REPLACE);
+  // delete files are only added in "overwrite" operations
+  private static final Set<String> VALIDATE_REPLACED_DATA_FILES_OPERATIONS =
+      ImmutableSet.of(DataOperations.OVERWRITE);

Review comment:
       I've considered this before and keep coming back to not splitting the 
existing operations. Right now, the operations convey what's happening to the 
table at a logical level. An append never removes rows, a delete never adds 
rows, and a replace never changes rows. Then overwrite is used for anything 
that adds and removes rows in the same commit. I think those definitions are 
easy to understand and reason about.
   
   If we added an operation for deltas, then we would either add a single 
"delta" operation or "delta-delete", "delta-overwrite", and "delta-replace" 
operations. If we use just "delta" then it wouldn't be clear what was happening 
logically, so we'd lose the ability to ignore a row-level deletes in a logical 
replace/rewrite. So I would conclude that we can either introduce 3 new 
operations or no new operations.
   
   So far, I don't think there's a strong justification for adding the 3 new 
operations because the existing ones already cover what you're doing to the 
data logically. We would only be adding more information about how that logical 
operation is carried out with files. There may be some value there, but I'd 
probably instead add a new property to signal that row-level deletes were used 
so that we don't alter the contract of `DataOperation`.
   
   Then the question becomes whether it is useful to know that row-level 
deletes are present. If so, we can add a property for it. In this case, it 
would save us some time reading the manifest list for an overwrite or delete. 
Do you think we should add something to signal this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to