The GitHub Actions job "License Binary Checker" on texera.git/main has failed.
Run started by GitHub user bobbai00 (triggered by bobbai00).

Head commit for run:
5e569568606a204070040b1521210cc9d853bc10 / Meng Wang <[email protected]>
fix: close CloseableIterable owners in Iceberg read paths (#5149)

### What changes were proposed in this PR?

Fixes a resource leak in `IcebergUtil.readDataFileAsIterator` and five
sibling sites in `IcebergDocument` that share the same anti-pattern:

```scala
closeableIterable.iterator().asScala
```

The bare `Iterator` returned to callers held no reference to its parent
`CloseableIterable`, so the parent could never be closed. Under
`S3FileIO`:

1. Every read leaked one `S3InputStream` (kept open until GC because
nothing in the call graph could close it).
2. The leaked stream had already borrowed one slot from the AWS SDK's
`ApacheHttpClient` connection pool (default **50**; texera did not
override).
3. After ~50 leaked reads the pool may saturate; new S3 reads then block
on `acquireConnection` until JVM restart.

This PR:

- Introduces `CloseableScalaIterator[T]` (`Iterator[T] with
AutoCloseable`, idempotent `close()`) in `IcebergUtil`, which wraps a
`CloseableIterable[T]` and propagates `close()` to the parent.
- Changes `IcebergUtil.readDataFileAsIterator` to return
`CloseableScalaIterator[Record]` instead of bare `Iterator[Record]`.
Callers must now close it (e.g. via `Using.resource`).
- Updates the single caller in `IcebergDocument`'s read iterator to
track the close handle in a sibling `AutoCloseable` field
(`currentRecordIteratorCloser`) and close it on file-switch, on
exhaustion, and on caller-imposed `until` cap. The sibling field is
necessary because `Iterator.drop(n)` returns a bare iterator that loses
the wrapper type.
- Wraps the four eagerly-consumed `planFiles()` call sites — `getCount`,
`seekToUsableFile`, `getTableStatistics`, `asInputStream` — in
`Using.resource` so the metadata-side `CloseableIterable<FileScanTask>`
is closed promptly.

**Known limitation (out of scope here):** if a caller of
`IcebergDocument.get()` / `getRange()` / `getAfter()` stops iterating
before `hasNext` returns `false` (e.g. throws mid-loop, or calls
`.take(n)` and then drops the result), the LAST file's
`CloseableScalaIterator` will leak until JVM GC. Fixing this requires
changing the public `Iterator[T]` return type on `VirtualDocument` to
`Iterator[T] with AutoCloseable` and updating all callers — best done as
a separate refactor.

### Any related issues, documentation, discussions?

Closes #5143.

### How was this PR tested?

- Added `IcebergUtilLeakSpec` (2 cases): validates that
`CloseableScalaIterator` (a) closes its parent `CloseableIterable` when
used inside `Using.resource`, and (b) is idempotent under repeated
`close()` calls.
- All existing iceberg specs still pass:
  - `IcebergUtilSpec`: 14/14
  - `IcebergUtilLeakSpec`: 2/2 (new)
- `IcebergDocumentSpec`: 18/18 (exercises the modified read iterator's
close-on-reassign / close-on-exhaustion paths against real Iceberg
infrastructure)
- `IcebergTableStatsSpec`: 12/12 (exercises `getTableStatistics` with
the new `Using.resource` wrap)
  - `IcebergDocumentConsoleMessagesSpec`: 1/1

Run locally:

```
sbt "WorkflowCore/testOnly org.apache.texera.amber.util.IcebergUtilSpec 
org.apache.texera.amber.util.IcebergUtilLeakSpec 
org.apache.texera.amber.storage.result.iceberg.*"
```

Result: `47 succeeded, 0 failed`.

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>

Report URL: https://github.com/apache/texera/actions/runs/26331395326

With regards,
GitHub Actions via GitBox

Reply via email to