[ https://issues.apache.org/jira/browse/GOBBLIN-1707?focusedWorklogId=812178&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-812178 ]
ASF GitHub Bot logged work on GOBBLIN-1707: ------------------------------------------- Author: ASF GitHub Bot Created on: 26/Sep/22 17:15 Start Date: 26/Sep/22 17:15 Worklog Time Spent: 10m Work Description: ZihanLi58 commented on code in PR #3569: URL: https://github.com/apache/gobblin/pull/3569#discussion_r980310564 ########## gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergTable.java: ########## @@ -46,36 +51,89 @@ public class IcebergTable { private final TableOperations tableOps; + /** @return metadata info limited to the most recent (current) snapshot */ public IcebergSnapshotInfo getCurrentSnapshotInfo() throws IOException { TableMetadata current = tableOps.current(); - Snapshot snapshot = current.currentSnapshot(); + return createSnapshotInfo(current.currentSnapshot(), Optional.of(current.metadataFileLocation())); + } + + /** @return metadata info for all known snapshots, ordered historically, with *most recent last* */ + public Iterator<IcebergSnapshotInfo> getAllSnapshotInfosIterator() { + TableMetadata current = tableOps.current(); + long currentSnapshotId = current.currentSnapshot().snapshotId(); + List<Snapshot> snapshots = current.snapshots(); + return Iterators.transform(snapshots.iterator(), snapshot -> { + try { + return IcebergTable.this.createSnapshotInfo( + snapshot, + currentSnapshotId == snapshot.snapshotId() ? Optional.of(current.metadataFileLocation()) : Optional.empty() + ); + } catch (IOException e) { + throw new RuntimeException(e); + } + }); + } + + /** + * @return metadata info for all known snapshots, but incrementally, so content overlap between snapshots appears + * only within the first as they're ordered historically, with *most recent last* + */ + public Iterator<IcebergSnapshotInfo> getIncrementalSnapshotInfosIterator() { Review Comment: It makes sense to me if we are doing the MVP at this point. But if we just want to copy the whole table, why not using a set to store and eliminate duplication directly instead of this complicated logic? As for remaining the order, I believe most of the overhead I'm concerning is the data file, that even just copy current snapshot, you will check all the old data files, comparing to that, metadata files does not seem to be a big deal here? Issue Time Tracking ------------------- Worklog Id: (was: 812178) Time Spent: 1h 50m (was: 1h 40m) > Add Iceberg support to DistCp > ----------------------------- > > Key: GOBBLIN-1707 > URL: https://issues.apache.org/jira/browse/GOBBLIN-1707 > Project: Apache Gobblin > Issue Type: Task > Components: gobblin-core > Reporter: Kip Kohn > Assignee: Abhishek Tiwari > Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Add capability for iceberg copy/replication to distcp. Support incremental > copy (only of delta changes since last time) in addition to full copy on > first time. -- This message was sent by Atlassian Jira (v8.20.10#820010)