laskoviymishka commented on code in PR #985:
URL: https://github.com/apache/iceberg-go/pull/985#discussion_r3188889334


##########
manifest.go:
##########
@@ -463,6 +481,13 @@ type ManifestFile interface {
        HasAddedFiles() bool
        // HasExistingFiles returns true if ExistingDataFiles > 0 or if it was 
null.
        HasExistingFiles() bool
+       // Entries streams the manifest entries from the manifest file using
+       // the provided file system IO interface. Entries that have been
+       // marked as deleted are skipped if discardDeleted is true.
+       //
+       // Prefer Entries over FetchEntries when walking large manifests
+       // since it avoids loading every entry into memory at once.
+       Entries(fs iceio.IO, discardDeleted bool) iter.Seq2[ManifestEntry, 
error]

Review Comment:
   Adding `Entries` to the public `ManifestFile` interface is a breaking change 
for any external implementer or mock — they'll fail to compile against the next 
module bump. That's probably acceptable for a 0.x module, but I'd want it 
called out: at minimum a release-note entry, plus a `// Deprecated:` line on 
`FetchEntries` if the intent is to phase it out.
   
   If we'd rather keep the interface stable, `Entries` could live as a method 
on the concrete `*manifestFile` only, or as a free helper 
`iceberg.IterManifestEntries(m, fs, discardDeleted)`. What's the preferred 
direction here?



##########
manifest.go:
##########
@@ -339,8 +340,35 @@ func (m *manifestFile) FirstRowID() *int64 { return 
m.FirstRowIDValue }
 
 func (m *manifestFile) HasAddedFiles() bool    { return m.AddedFilesCount != 0 
}
 func (m *manifestFile) HasExistingFiles() bool { return m.ExistingFilesCount 
!= 0 }
-func (m *manifestFile) FetchEntries(fs iceio.IO, discardDeleted bool) 
([]ManifestEntry, error) {
-       return fetchManifestEntries(m, fs, discardDeleted)
+
+func (m *manifestFile) Entries(fs iceio.IO, discardDeleted bool) 
iter.Seq2[ManifestEntry, error] {
+       return func(yield func(ManifestEntry, error) bool) {
+               f, err := fs.Open(m.FilePath())
+               if err != nil {
+                       yield(nil, err)
+
+                       return
+               }
+               defer func() {
+                       _ = f.Close()

Review Comment:
   The streaming path drops Close errors here (`_ = f.Close()`), while 
`FetchEntries` below uses `internal.CheckedClose` to surface them via the named 
return. For object-store IO wrappers that flush metrics or surface deferred 
errors via Close, that's a real signal we'd lose as callers migrate from one to 
the other.
   
   Could we either yield the close error as a final synthetic `(nil, cerr)` 
pair, or at minimum call out the asymmetry in both godocs? Same applies to `_ = 
manifestReader.Close()` inside `IterManifest`.



##########
manifest.go:
##########
@@ -751,33 +776,54 @@ func (c *ManifestReader) ReadEntry() (ManifestEntry, 
error) {
        return tmp, nil
 }
 
+// IterManifest returns an iterator that streams manifest entries from
+// the provided reader without buffering them. If discardDeleted is true,
+// entries whose status is "deleted" are skipped.
+func IterManifest(m ManifestFile, f io.Reader, discardDeleted bool) 
iter.Seq2[ManifestEntry, error] {

Review Comment:
   `IterManifest` ends up with one in-tree caller (`ReadManifest`), and the 
method form `Entries` already owns the file lifecycle for everyone else. 
Exporting it locks in a third public way to read entries (alongside `Entries` 
and `FetchEntries`) without an obvious external use case — and the doc-comment 
doesn't pin down the contract (single-shot? error yields once and stops? close 
errors swallowed?).
   
   Could we lower-case to `iterManifest` unless there's a known consumer that 
needs the bare-`io.Reader` form? If it stays exported, worth spelling out those 
guarantees in the godoc.



##########
cmd/iceberg/output.go:
##########
@@ -150,12 +150,11 @@ func (t textOutput) Files(tbl *table.Table, history bool) 
{
                        snapshotTree = append(snapshotTree, 
pterm.LeveledListItem{
                                Level: 1, Text: "Manifest: " + m.FilePath(),
                        })
-                       datafiles, err := m.FetchEntries(afs, false)
-                       if err != nil {
-                               t.Error(err)
-                               os.Exit(1)
-                       }
-                       for _, e := range datafiles {
+                       for e, err := range m.Entries(afs, false) {
+                               if err != nil {
+                                       t.Error(err)
+                                       os.Exit(1)

Review Comment:
   With the slice-based `FetchEntries` the file was already closed before this 
branch ran. Now the file is open for the duration of iteration, so `os.Exit(1)` 
here skips the deferred `f.Close()` inside `Entries`. Cosmetic for a 
short-lived CLI but a new leak shape worth fixing while we're here — break out 
of the loop and exit after the range, or refactor `Files` to return an error.



##########
table/scanner.go:
##########
@@ -153,13 +153,12 @@ func GetPartitionRecord(dataFile iceberg.DataFile, 
partitionType *iceberg.Struct
 func openManifest(io io.IO, manifest iceberg.ManifestFile,
        partitionFilter, metricsEval func(iceberg.DataFile) (bool, error),
 ) ([]iceberg.ManifestEntry, error) {
-       entries, err := manifest.FetchEntries(io, true)
-       if err != nil {
-               return nil, err
-       }
+       var out []iceberg.ManifestEntry

Review Comment:
   We lost the `make([]iceberg.ManifestEntry, 0, len(entries))` preallocation 
here. `openManifest` is on the scan hot path, so for wide manifests this trades 
one fixed allocation for several `append` regrowths.
   
   Worth pre-sizing from the manifest counts — `int(manifest.AddedDataFiles()) 
+ int(manifest.ExistingDataFiles())` is a defensible upper bound since 
`discardDeleted=true`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to