RussellSpitzer commented on code in PR #15656:
URL: https://github.com/apache/iceberg/pull/15656#discussion_r2943240445


##########
core/src/main/java/org/apache/iceberg/SnapshotChanges.java:
##########
@@ -178,13 +188,17 @@ private void cacheDeleteFileChanges() {
     ImmutableList.Builder<DeleteFile> adds = ImmutableList.builder();
     ImmutableList.Builder<DeleteFile> deletes = ImmutableList.builder();
 
-    Iterable<ManifestFile> relevantDeleteManifests =
-        Iterables.filter(
-            snapshot.deleteManifests(io),
-            manifest -> Objects.equals(manifest.snapshotId(), 
snapshot.snapshotId()));
-
     Iterable<CloseableIterable<Pair<ManifestEntry.Status, DeleteFile>>> 
manifestReadTasks =
-        Iterables.transform(relevantDeleteManifests, this::readDeleteManifest);
+        Iterables.concat(

Review Comment:
   Techinically this is different than what SnapshotUtil was doing before.
   
   newFilesBetween was doing this
   
   ```
   ParallelIterable
   ├── Task 1: Snapshot A
   │   ├── Read manifest list          ─┐
   │   ├── Read manifest file 1         │ all serial within task
   │   └── Read manifest file 2        ─┘
   ├── Task 2: Snapshot B               ├── parallel across tasks
   │   ├── Read manifest list          ─┐
   │   └── Read manifest file 1        ─┘
   └── Task 3: Snapshot C
       ├── Read manifest list          ─┐
       ├── Read manifest file 1         │
       └── Read manifest file 2        ─┘
   ```
   
   Snapshot Changes does this
   ```
   Sequential enumeration (manifest list reads)
   ├── Snapshot A: Read manifest list → manifest 1, manifest 2
   ├── Snapshot B: Read manifest list → manifest 1
   └── Snapshot C: Read manifest list → manifest 1, manifest 2
   ParallelIterable (manifest file reads only)
   ├── Task 1: Read manifest A-1
   ├── Task 2: Read manifest A-2
   ├── Task 3: Read manifest B-1    ├── parallel across tasks
   ├── Task 4: Read manifest C-1
   └── Task 5: Read manifest C-2
   ```
   
   Now this means we aren't reading multiple manifest lists in parallel 
explicitly now, but the parallel iterable will trigger opening new snapshots 
once it has finished buffering manifest read tasks in parallel for existing 
snapshots. So it should be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to