[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

GitBox Thu, 29 Jul 2021 17:14:20 -0700


rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r679563193




##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> 
dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level 
deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, 
replacedDataFiles);

Review comment:
       @openinx, I opened a draft PR, 
https://github.com/apache/iceberg/pull/2892, for to fix this, but I'm not sure 
that we actually want to merge it.
   
   Originally, I copied the `validateDataFilesExist` implementation from 
`RowDelta` and added tests. That works great and is a fairly simple commit if 
we want it. But as I was testing it, I added cases for delete, overwrite, 
rewrite, and a transactional delete and append to overwrite.
   
   Then I realized that not all of those cases apply. For example, if I delete 
a file that has row deltas, that's perfectly okay. The operation actually 
deletes all rows that weren't already deleted and any dangling positional 
deletes will just never get applied. Similarly, an overwrite from, for example, 
a MERGE INTO operation that replaces a file will need to first read that file 
and then build its replacement. The read must merge the existing deletes, so 
the rows will already be removed from the replacement file. Then the deletes in 
that case are also fine to leave dangling even if they are rewritten into a new 
file.
   
   I ended up leaving a validation that the data files are not rewritten, but I 
also think that may not be necessary because I don't think it would be 
reasonable to rewrite a data file without also applying the deletes. If we 
build a compaction operation we will probably apply deletes during compaction. 
We could also build one that compacts just data files, but in order for that 
operation to be correct it would also need to rewrite delete files (that are 
pointing to the compacted data files) or use an older sequence number with 
equality deletes. If it rewrote delete files, then the delete rewrite would 
fail. If it changed sequence numbers, then we would have a lot more validation 
to do when building that plan.
   
   After thinking about it, I don't think that we actually need to handle any 
cases where a compaction and delete file rewrite are concurrent, at least not 
from the delete file rewrite side. Any compaction should handle deletes as part 
of the compaction.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #2865: Core: Add validation for row-level deletes with rewrites

Reply via email to