rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r679563193
##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile>
dataFilesToDelete, Set<DeleteFile
return this;
}
+
+ @Override
+ public RewriteFiles validateFromSnapshot(long snapshotId) {
+ this.startingSnapshotId = snapshotId;
+ return this;
+ }
+
+ @Override
+ protected void validate(TableMetadata base) {
+ if (replacedDataFiles.size() > 0) {
+ // if there are replaced data files, there cannot be any new row-level
deletes for those data files
+ validateNoNewDeletesForDataFiles(base, startingSnapshotId,
replacedDataFiles);
Review comment:
@openinx, I opened a draft PR,
https://github.com/apache/iceberg/pull/2892, for to fix this, but I'm not sure
that we actually want to merge it.
Originally, I copied the `validateDataFilesExist` implementation from
`RowDelta` and added tests. That works great and is a fairly simple commit if
we want it. But as I was testing it, I added cases for delete, overwrite,
rewrite, and a transactional delete and append to overwrite.
Then I realized that not all of those cases apply. For example, if I delete
a file that has row deltas, that's perfectly okay. The operation actually
deletes all rows that weren't already deleted and any dangling positional
deletes will just never get applied. Similarly, an overwrite from, for example,
a MERGE INTO operation that replaces a file will need to first read that file
and then build its replacement. The read must merge the existing deletes, so
the rows will already be removed from the replacement file. Then the deletes in
that case are also fine to leave dangling even if they are rewritten into a new
file.
I ended up leaving a validation that the data files are not rewritten, but I
also think that may not be necessary because I don't think it would be
reasonable to rewrite a data file without also applying the deletes. If we
build a compaction operation we will probably apply deletes during compaction.
We could also build one that compacts just data files, but in order for that
operation to be correct it would also need to rewrite delete files (that are
pointing to the compacted data files) or use an older sequence number with
equality deletes. If it rewrote delete files, then the delete rewrite would
fail. If it changed sequence numbers, then we would have a lot more validation
to do when building that plan.
After thinking about it, I don't think that we actually need to handle any
cases where a compaction and delete file rewrite are concurrent, at least not
from the delete file rewrite side. Any compaction should handle deletes as part
of the compaction.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]