rdblue commented on a change in pull request #2892:
URL: https://github.com/apache/iceberg/pull/2892#discussion_r680235789
##########
File path: core/src/test/java/org/apache/iceberg/TestRewriteFiles.java
##########
@@ -641,4 +641,115 @@ public void testNewDeleteFile() {
.rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
.apply();
}
+
+ @Test
+ public void testRewriteReferencedDataFile() {
+ Assume.assumeTrue("Delete files are only supported in v2", formatVersion >
1);
+
+ table.newAppend()
+ .appendFile(FILE_A)
+ .commit();
+
+ table.newRowDelta()
+ .addDeletes(FILE_A_DELETES)
+ .commit();
+
+ long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+ // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+ table.newRewrite()
+ .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+ .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+ .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES),
Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+ .commit();
+
+ long snapshotBeforeRewriteFileA = table.currentSnapshot().snapshotId();
+
+ // rewrite FILE_A as FILE_A2
+ table.newRewrite()
+ .validateFromSnapshot(table.currentSnapshot().snapshotId())
+ .rewriteFiles(Sets.newSet(FILE_A), Sets.newSet(FILE_A2))
+ .commit();
+
+ AssertHelpers.assertThrows("Should fail because a referenced file was
rewritten",
+ ValidationException.class, "Cannot commit, missing data files",
+ () -> table.newRewrite()
+ .validateFromSnapshot(snapshotBeforeRewriteFileA)
+ .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+ .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES),
Sets.newSet(), Sets.newSet(FILE_A_DELETES))
+ .apply());
+ }
+
+ @Test
+ public void testOverwriteReferencedDataFile() {
+ Assume.assumeTrue("Delete files are only supported in v2", formatVersion >
1);
+
+ table.newAppend()
+ .appendFile(FILE_A)
+ .commit();
+
+ table.newRowDelta()
+ .addDeletes(FILE_A_DELETES)
+ .commit();
+
+ long snapshotBeforeDeleteRewrite = table.currentSnapshot().snapshotId();
+
+ // simulate rewrite deletes in FILE_A_DELETES to FILE_B_DELETES
+ table.newRewrite()
+ .validateFromSnapshot(snapshotBeforeDeleteRewrite)
+ .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+ .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_A_DELETES),
Sets.newSet(), Sets.newSet(FILE_B_DELETES))
+ .commit();
+
+ long snapshotBeforeOverwriteFileA = table.currentSnapshot().snapshotId();
+
+ // overwrite FILE_A with FILE_A2
+ table.newOverwrite()
+ .deleteFile(FILE_A)
+ .addFile(FILE_A2)
+ .commit();
+
+ // the rewrite succeeds because the overwrite is required to read FILE_A
correctly
+ table.newRewrite()
+ .validateFromSnapshot(snapshotBeforeOverwriteFileA)
+ .validateDataFilesNotRewritten(Sets.newSet(FILE_A.path()))
+ .rewriteFiles(Sets.newSet(), Sets.newSet(FILE_B_DELETES),
Sets.newSet(), Sets.newSet(FILE_A_DELETES))
Review comment:
This is the case I was thinking about in my last comment on the other
PR. In order for `RewriteAction_1` to be valid, it must reuse the sequence from
`FILE_A` for `FILE_B`. Otherwise, the rewrite on its own would have un-deleted
a row because `EQ_DELETE_FILE_A` would no longer apply.
The validation in this PR can catch this case because the files referenced
by `POS_DELETE_FILE_C` would be passed to the validation. That's `FILE_A` in
this case and the commit for `RewriteAction_2` would check that `FILE_A` still
exists and would fail. I'm fine merging this PR if you think that this is
something that may happen.
But, I think that it is really unlikely that rewrites will alter sequence
numbers to avoid applying deletes. That makes little sense because you may as
well apply deletes as long as you're rewriting the data. But assuming that you
wanted to, this may not even be possible if the files that are rewritten are
from different sequence numbers, with different sets of equality delete files
that must be applied. I think a far better option is to apply the deletes when
rewriting.
I'm fine merging this if you think we need it. I'll remove the draft status
so that we can.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]