laskoviymishka commented on code in PR #1118:
URL: https://github.com/apache/iceberg-go/pull/1118#discussion_r3288292110


##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx 
context.Context, preTable *Table,
                }
        }
 
+       // Manifest paths kept alive by retained snapshots, plus their
+       // loaded manifest-file slices so the live-data-file pass below
+       // doesn't re-download each manifest list.
+       retainedManifests := make(map[string]struct{})
+       retainedSnapshotManifests := make(map[int64][]iceberg.ManifestFile)

Review Comment:
   the snapshot-id key here is never read — the only consumer below ranges `for 
_, mans := range retainedSnapshotManifests`. I'd drop the map and either 
flatten while we're collecting or keep a `[][]iceberg.ManifestFile` so the 
walkedRetained pass below isn't paying for a map keyed on a field nobody uses.
   
   while we're here, the inner loop already gives us each `man` once per 
`(snap, man)` pair — we could just dedupe by `man.FilePath()` during this pass 
and hand the second pass a flat `[]iceberg.ManifestFile` of unique manifests, 
which also removes the need for `walkedRetained` below. wdyt?



##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx 
context.Context, preTable *Table,
                }
        }
 
+       // Manifest paths kept alive by retained snapshots, plus their

Review Comment:
   small nit: "doesn't re-download each manifest list" reads like the whole 
function skips manifest-list reads, but the expired loop below still calls 
`snap.Manifests(prefs)` per expired snapshot. I'd tighten to something like `// 
Preload retained manifest lists once so the live-data-file pass below can walk 
them without re-fetching.` to keep the scope honest.



##########
table/updates.go:
##########
@@ -529,14 +560,17 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx 
context.Context, preTable *Table,
                }
        }
 
-       for _, snap := range postTable.Metadata().Snapshots() {
-               mans, err := snap.Manifests(prefs)
-               if err != nil {
-                       return err
-               }
-
+       // Keep files still referenced (non-DELETED) by retained manifests,
+       // including data files carried forward as EXISTING by manifest
+       // merges. Each retained manifest is walked at most once.
+       walkedRetained := make(map[string]struct{}, len(retainedManifests))

Review Comment:
   minor perf regression in the "nothing to delete" path: when `filesToDelete` 
is empty (no expired snapshots had unique manifests), this pass still opens 
every retained manifest and iterates entries to call `delete` on a no-op key. 
I'd guard the whole retained walk with `if len(filesToDelete) > 0`. Restores 
the short-circuit and keeps the PR's spirit of "open each manifest at most 
once, and only when it matters".



##########
table/updates.go:
##########
@@ -506,6 +506,27 @@ func (u *removeSnapshotsUpdate) PostCommit(ctx 
context.Context, preTable *Table,
                }
        }
 
+       // Manifest paths kept alive by retained snapshots, plus their
+       // loaded manifest-file slices so the live-data-file pass below
+       // doesn't re-download each manifest list.
+       retainedManifests := make(map[string]struct{})
+       retainedSnapshotManifests := make(map[int64][]iceberg.ManifestFile)
+       for _, snap := range postTable.Metadata().Snapshots() {
+               mans, err := snap.Manifests(prefs)
+               if err != nil {
+                       return err
+               }
+               retainedSnapshotManifests[snap.SnapshotID] = mans
+               for _, man := range mans {
+                       retainedManifests[man.FilePath()] = struct{}{}
+               }
+       }
+
+       // Open each orphaned manifest at most once: skip manifests that
+       // retained snapshots still reference, and dedupe across expired
+       // snapshots that share manifests by reference.
+       visitedManifests := make(map[string]struct{})

Review Comment:
   the dedup invariant this PR is built on isn't covered by any test. 
`TestRemoveSnapshotsPostCommitDeletesStatisticsFiles` and 
`TestRemoveSnapshotsPostCommitPreservesStatisticsOfSurvivingSnapshots` both use 
snapshots with empty manifest lists, so `snap.Manifests()` returns `[]` and 
neither of the loops you're changing here ever executes.
   
   Adding two cases against a trackingIO-style fs would lock this in: (a) 
manifest M shared between an expired snapshot and a retained one, asserting M 
and its data files survive; (b) manifest M shared between two expired 
snapshots, asserting M+its data files are deleted exactly once and `Entries()` 
is called once. The second one also pins the perf claim.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to