rdblue commented on a change in pull request #3069:
URL: https://github.com/apache/iceberg/pull/3069#discussion_r711803754
##########
File path: core/src/main/java/org/apache/iceberg/BaseOverwriteFiles.java
##########
@@ -127,8 +131,17 @@ protected void validate(TableMetadata base) {
}
}
+ // validating concurrently added data files is optional and is only needed
to achieve serializable isolation
if (conflictDetectionFilter != null && base.currentSnapshot() != null) {
validateAddedDataFiles(base, startingSnapshotId,
conflictDetectionFilter, caseSensitive);
}
+
+ // validating concurrently added delete files is required whenever we
overwrite specific data files
+ // if we find a new delete file matching one of the data files we are
trying to overwrite,
+ // we must fail this operation as it would undelete rows that were removed
concurrently
+ if (deletedDataFiles.size() > 0) {
+ validateNoNewDeletesForDataFiles(
+ base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles,
caseSensitive);
Review comment:
I'm not sure that I agree that we only need to check the files that were
directly deleted.
For example, you can use Spark to replace an expression using
`DataFrameWriterV2`:
```scala
df.writeTo(t).overwrite($"date" === "2021-10-01")
```
Whether the deleted files should be checked for row-level deletes depends on
whether the written `df` is a result of reading the table. If you're performing
your own merge, then it should be. I realize that there's not currently a way
to enable the validation, but we could support write properties for this
eventually:
```scala
df.writeTo(t)
.option("validate-snapshot-id", 1234L) // use serializable isolation
.overwrite($"date" === "2021-10-01")
```
Is this something that we don't need to support right now because no one is
calling it? I think we should still consider adding the validation for later.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]