[ 
https://issues.apache.org/jira/browse/GOBBLIN-1707?focusedWorklogId=811762&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-811762
 ]

ASF GitHub Bot logged work on GOBBLIN-1707:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/Sep/22 23:20
            Start Date: 23/Sep/22 23:20
    Worklog Time Spent: 10m 
      Work Description: phet commented on code in PR #3569:
URL: https://github.com/apache/gobblin/pull/3569#discussion_r979115309


##########
gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java:
##########
@@ -46,36 +51,89 @@
 public class IcebergTable {
   private final TableOperations tableOps;
 
+  /** @return metadata info limited to the most recent (current) snapshot */
   public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException {
     TableMetadata current = tableOps.current();
-    Snapshot snapshot = current.currentSnapshot();
+    return createSnapshotInfo(current.currentSnapshot(), 
Optional.of(current.metadataFileLocation()));
+  }
+
+  /** @return metadata info for all known snapshots, ordered historically, 
with *most recent last* */
+  public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() {
+    TableMetadata current = tableOps.current();
+    long currentSnapshotId = current.currentSnapshot().snapshotId();
+    List<Snapshot> snapshots = current.snapshots();
+    return Iterators.transform(snapshots.iterator(), snapshot -> {
+      try {
+        return IcebergTable.this.createSnapshotInfo(
+            snapshot,
+            currentSnapshotId == snapshot.snapshotId() ? 
Optional.of(current.metadataFileLocation()) : Optional.empty()
+        );
+      } catch (IOException e) {
+        throw new RuntimeException(e);
+      }
+    });
+  }
+
+  /**
+   * @return metadata info for all known snapshots, but incrementally, so 
content overlap between snapshots appears
+   * only within the first as they're ordered historically, with *most recent 
last*
+   */
+  public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() {

Review Comment:
   presently (this PR too), we only support copying the entire iceberg.  I am 
already working on the next iteration which is to calculate the delta between 
the src and dest.  (for integration testing that, we'll bootstrap "destination" 
with a replicated copy by using this code herein.)
   
   I absolutely agree we do not want to eagerly load the iceberg's entire 
metadata.  this interface is an `Iterator` for that reason.  I will 
`.reverse()` the order, to enable walking backwards in history.  the delta 
comparison will look on the destination for each snapshot's manifest list file. 
 when that's found we'll infer that replication up to that point was achieved.
   
   so yes, that is a plan I'm in-progress with carrying out.  this commit is 
just to give us working code against an MVP that does only entire copy 
(non-incremental, delta copy).





Issue Time Tracking
-------------------

    Worklog Id:     (was: 811762)
    Time Spent: 1h 20m  (was: 1h 10m)

> Add Iceberg support to DistCp
> -----------------------------
>
>                 Key: GOBBLIN-1707
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1707
>             Project: Apache Gobblin
>          Issue Type: Task
>          Components: gobblin-core
>            Reporter: Kip Kohn
>            Assignee: Abhishek Tiwari
>            Priority: Major
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Add capability for iceberg copy/replication to distcp.  Support incremental 
> copy (only of delta changes since last time) in addition to full copy on 
> first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to