RussellSpitzer commented on code in PR #15656:
URL: https://github.com/apache/iceberg/pull/15656#discussion_r2943240445
##########
core/src/main/java/org/apache/iceberg/SnapshotChanges.java:
##########
@@ -178,13 +188,17 @@ private void cacheDeleteFileChanges() {
ImmutableList.Builder<DeleteFile> adds = ImmutableList.builder();
ImmutableList.Builder<DeleteFile> deletes = ImmutableList.builder();
- Iterable<ManifestFile> relevantDeleteManifests =
- Iterables.filter(
- snapshot.deleteManifests(io),
- manifest -> Objects.equals(manifest.snapshotId(),
snapshot.snapshotId()));
-
Iterable<CloseableIterable<Pair<ManifestEntry.Status, DeleteFile>>>
manifestReadTasks =
- Iterables.transform(relevantDeleteManifests, this::readDeleteManifest);
+ Iterables.concat(
Review Comment:
Techinically this is different than what SnapshotUtil was doing before.
newFilesBetween was doing this
```
ParallelIterable
├── Task 1: Snapshot A
│ ├── Read manifest list ─┐
│ ├── Read manifest file 1 │ all serial within task
│ └── Read manifest file 2 ─┘
├── Task 2: Snapshot B ├── parallel across tasks
│ ├── Read manifest list ─┐
│ └── Read manifest file 1 ─┘
└── Task 3: Snapshot C
├── Read manifest list ─┐
├── Read manifest file 1 │
└── Read manifest file 2 ─┘
```
Snapshot Changes does this
```
Sequential enumeration (manifest list reads)
├── Snapshot A: Read manifest list → manifest 1, manifest 2
├── Snapshot B: Read manifest list → manifest 1
└── Snapshot C: Read manifest list → manifest 1, manifest 2
ParallelIterable (manifest file reads only)
├── Task 1: Read manifest A-1
├── Task 2: Read manifest A-2
├── Task 3: Read manifest B-1 ├── parallel across tasks
├── Task 4: Read manifest C-1
└── Task 5: Read manifest C-2
```
So instead of reading each manifest list in it's own thread, we will start
pulling from the pool of manifests into a buffer and open each manifest list as
soon as we have an open thread and no manifests left in the previous manifest
list.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]