rdblue commented on a change in pull request #2865:
URL: https://github.com/apache/iceberg/pull/2865#discussion_r680229193



##########
File path: core/src/main/java/org/apache/iceberg/BaseRewriteFiles.java
##########
@@ -85,4 +90,18 @@ public RewriteFiles rewriteFiles(Set<DataFile> 
dataFilesToDelete, Set<DeleteFile
 
     return this;
   }
+
+  @Override
+  public RewriteFiles validateFromSnapshot(long snapshotId) {
+    this.startingSnapshotId = snapshotId;
+    return this;
+  }
+
+  @Override
+  protected void validate(TableMetadata base) {
+    if (replacedDataFiles.size() > 0) {
+      // if there are replaced data files, there cannot be any new row-level 
deletes for those data files
+      validateNoNewDeletesForDataFiles(base, startingSnapshotId, 
replacedDataFiles);

Review comment:
       > You mean people will need to know whether the data-files-only 
compaction will break the existing data set ? This puts too much burden on 
users ? I think we still need to provide the validation because we developers 
are sure that those cases will break the data correctness ...
   
   The contract of the `RewriteFiles` operation is that it does not change the 
table data. That is explicitly stated in the `rewriteFiles` configuration 
method's javadoc:
   
   > Add a rewrite that replaces one set of files with another set that 
**contains the same data**.
   
   I think that is a reasonable requirement for the operation. Compacting only 
data files without accounting for deletes would change the table data, even 
without a concurrent delete file rewrite. For example, if I have
   
   | `file-a.parquet` |
   | -- |
   | `1, 'x'` |
   | `2, 'y'` |
   
   | `file-b.parquet` |
   | -- |
   | `3, 'z'` |
   
   | `file-a-deletes.parquet` |
   | -- |
   | `'file-a.parquet', 0` |
   
   Compacting `file-a.parquet` and `file-b.parquet` together would undelete row 
`id=1`. There is no way for Iceberg to check whether the rewrite ignored 
`file-a-deletes.parquet` or applied the deletes. The correctness problem is 
entirely in the compaction operation.
   
   We have to trust the `RewriteFiles` caller to do the right thing and 
preserve correctness.
   
   That means either the caller must do one of the following:
   1. Remove the deleted row
   2. Rewrite `file-a-deletes.parquet` to point the delete at the new data file
   3. Write the new data file with a compatible sequence number (only valid if 
the deletes are equality deletes so they would still apply)
   
   If the caller uses option 1, then it doesn't matter to concurrent operations 
that are rewriting the delete. It becomes a dangling delete that doesn't need 
to be applied. If the caller uses option 2, then there is a new file tracking 
the delete that would not be part of a concurrent rewrite operation; a 
concurrent rewrite may contain a dangling delete.
   
   The only case we would need to worry about is option 3, and only if a 
concurrent delete rewrites an equality delete as a positional delete. To be 
careful, we could merge #2892 to be able to validate this case. But I'm 
skeptical that altering the sequence number of the compacted file is a good 
idea. It may not even be possible if the sequence number of a delete that must 
be applied is between the sequence numbers of the starting data files.
   
   As I said, I'm okay not merging the validation for option 3 because I think 
it is unlikely that this is going to be a viable strategy. But I'm fine merging 
#2892 if you think it is needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to