[ 
https://issues.apache.org/jira/browse/GOBBLIN-1707?focusedWorklogId=812178&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-812178
 ]

ASF GitHub Bot logged work on GOBBLIN-1707:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Sep/22 17:15
            Start Date: 26/Sep/22 17:15
    Worklog Time Spent: 10m 
      Work Description: ZihanLi58 commented on code in PR #3569:
URL: https://github.com/apache/gobblin/pull/3569#discussion_r980310564


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java:
##########
@@ -46,36 +51,89 @@
 public class IcebergTable {
   private final TableOperations tableOps;
 
+  /** @return metadata info limited to the most recent (current) snapshot */
   public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException {
     TableMetadata current = tableOps.current();
-    Snapshot snapshot = current.currentSnapshot();
+    return createSnapshotInfo(current.currentSnapshot(), 
Optional.of(current.metadataFileLocation()));
+  }
+
+  /** @return metadata info for all known snapshots, ordered historically, 
with *most recent last* */
+  public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() {
+    TableMetadata current = tableOps.current();
+    long currentSnapshotId = current.currentSnapshot().snapshotId();
+    List<Snapshot> snapshots = current.snapshots();
+    return Iterators.transform(snapshots.iterator(), snapshot -> {
+      try {
+        return IcebergTable.this.createSnapshotInfo(
+            snapshot,
+            currentSnapshotId == snapshot.snapshotId() ? 
Optional.of(current.metadataFileLocation()) : Optional.empty()
+        );
+      } catch (IOException e) {
+        throw new RuntimeException(e);
+      }
+    });
+  }
+
+  /**
+   * @return metadata info for all known snapshots, but incrementally, so 
content overlap between snapshots appears
+   * only within the first as they're ordered historically, with *most recent 
last*
+   */
+  public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() {

Review Comment:
   It makes sense to me if we are doing the MVP at this point. But if we just 
want to copy the whole table, why not using a set to store and eliminate 
duplication directly instead of this complicated logic? As for remaining the 
order, I believe most of the overhead I'm concerning is the data file, that 
even just copy current snapshot, you will check all the old data files, 
comparing to that, metadata files does not seem to be a big deal here?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 812178)
    Time Spent: 1h 50m  (was: 1h 40m)

> Add Iceberg support to DistCp
> -----------------------------
>
>                 Key: GOBBLIN-1707
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1707
>             Project: Apache Gobblin
>          Issue Type: Task
>          Components: gobblin-core
>            Reporter: Kip Kohn
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Add capability for iceberg copy/replication to distcp.  Support incremental 
> copy (only of delta changes since last time) in addition to full copy on 
> first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to