[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4870: API: Add a scan for changes

GitBox Tue, 28 Jun 2022 09:34:14 -0700


aokolnychyi commented on code in PR #4870:
URL: https://github.com/apache/iceberg/pull/4870#discussion_r908697632



##########
api/src/main/java/org/apache/iceberg/DeletedRowsScanTask.java:
##########
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * A scan task for deleted data records generated by adding delete files to 
the table.
+ */
+public interface DeletedRowsScanTask extends ChangelogScanTask {

Review Comment:
   I want to give specific examples for cases I mentioned 
[earlier](https://github.com/apache/iceberg/pull/4870#discussion_r891688733).
   
   **Concurrent merge-on-read DELETEs in Spark**
   
   We have `data_file_A`:
   
   ```
   1, "a" (pos 0)
   2, "b" (pos 1)
   3, "c" (pos 2)
   4, "d" (pos 3)
   ```
   
   Suppose we have two concurrent DELETEs (`d1` and `d2`). The first DELETE 
removes records in `pos 0` and `pos 2`. The second one concurrently removes 
`pos 0` and `pos 1`. We allow the second DELETE to commit cause it is not in 
conflict.
   
   How should our changelog look like?
   
   Changelog for `d1`:
   
   ```
   deleted, 1, "a"
   deleted, 3, "c"
   ```
   
   Changelog for `d2`:
   
   ```
   deleted, 2, "b"
   ```
   
   I think `1, "a"` should only appear in `d1` despite that a delete file added 
in `d2` refers to it.
   
   **Equality deletes against the same data file**
   
   We have `data_file_A`:
   
   ```
   1, "hr" (pos 0)
   2, "sw" (pos 1)
   3, "hr" (pos 2)
   4, "sw" (pos 3)
   ```
   
   Suppose we have a GDPR delete `d1` that adds an equality delete for `1` and 
a concurrent equality delete `d2` that removes all records in `hr` department.
   
   How should our changelog look like?
   
   Changelog for `d1`:
   
   ```
   deleted, 1, "hr"
   ```
   
   Changelog for `d2`:
   
   ```
   deleted, 3, "hr"
   ```
   
   I don't think outputting `1, "hr"` again in `d2` would be correct as that 
record wasn't live when `d2` committed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4870: API: Add a scan for changes

Reply via email to