[GitHub] [iceberg] rdblue commented on a diff in pull request #4870: API: Add a scan for changes

GitBox Tue, 28 Jun 2022 16:34:30 -0700


rdblue commented on code in PR #4870:
URL: https://github.com/apache/iceberg/pull/4870#discussion_r909079955



##########
api/src/main/java/org/apache/iceberg/ChangelogScanTask.java:
##########
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+/**
+ * A changelog scan task.
+ */
+public interface ChangelogScanTask extends FileScanTask {
+  /**
+   * Returns the operation type (i.e. insert/delete).
+   */
+  ChangelogOperation operation();
+
+  /**
+   * Returns the relative commit order in which the changes must be applied.
+   */
+  int commitOrder();
+
+  /**
+   * Returns the snapshot ID in which the changes were committed.
+   */
+  long commitSnapshotId();

Review Comment:
   I may not be completely following, so correct me if I'm not understanding 
something.
   
   This interface appears to be written to contain changes from a single 
snapshot. For example, it could be reading deletes that were committed via a 
delete file that was added in snapshot `S`. In that case, `operation=DELETE`, 
`changeOrder=ord(S)`, and `commitSnapshotId=S`.
   
   If I understand this thread, the question is how to handle "net changes 
across snapshots", which I interpret as basically "squashing" snapshots to 
produce the output. If `S1` adds rows with ids 4, 5, and 6, then `S2` deletes 
the rows with id 5 and 7, then the result could be the rows with ids 4 and 6, 
along with a delete for id 7.
   
   We can avoid the problem by always producing one task for `S1` with rows 4, 
5, and 6, and then another task (or set of tasks) for `S2` with the delete for 
ids 5 and 7, which is what the interface appears to be written for. But I think 
the point is that we can read data files normally and apply deletes to easily 
produce the squashed version, which could be quite valuable to end users.
   
   I think we can solve this by setting the commit snapshot and change order to 
the values for `S1`. The difference between the two cases is that the read of 
the inserted data applies the newer deletes in the window of snapshots. So that 
task will always produce values that were added in snapshot `S1`, rows 4 and 5. 
Then a separate task would be created for the deletes from `S2` (with `S2` for 
its `commitSnapshotId`), which would not emit id 5 because its change order was 
not strictly less than `ord(S1)`. That task would produce a deleted row for id 
7, using the delete task's snapshot ID. So I think in all cases this works. We 
just still need to separate the deletes in their own task, still.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a diff in pull request #4870: API: Add a scan for changes

Reply via email to