mengw15 opened a new pull request, #5149:
URL: https://github.com/apache/texera/pull/5149
### What changes were proposed in this PR?
Fixes a resource leak in `IcebergUtil.readDataFileAsIterator` and five
sibling sites in `IcebergDocument` that share the same anti-pattern:
```scala
closeableIterable.iterator().asScala
```
The bare `Iterator` returned to callers held no reference to its parent
`CloseableIterable`, so the parent could never be closed. Under `S3FileIO`:
1. Every read leaked one `S3InputStream` (kept open until GC because nothing
in the call graph could close it).
2. The leaked stream had already borrowed one slot from the AWS SDK's
`ApacheHttpClient` connection pool (default **50**; texera did not override).
3. After ~50 leaked reads the pool may saturate; new S3 reads then block on
`acquireConnection` until JVM restart.
This PR:
- Introduces `CloseableScalaIterator[T]` (`Iterator[T] with AutoCloseable`,
idempotent `close()`) in `IcebergUtil`, which wraps a `CloseableIterable[T]`
and propagates `close()` to the parent.
- Changes `IcebergUtil.readDataFileAsIterator` to return
`CloseableScalaIterator[Record]` instead of bare `Iterator[Record]`. Callers
must now close it (e.g. via `Using.resource`).
- Updates the single caller in `IcebergDocument`'s read iterator to track
the close handle in a sibling `AutoCloseable` field
(`currentRecordIteratorCloser`) and close it on file-switch, on exhaustion, and
on caller-imposed `until` cap. The sibling field is necessary because
`Iterator.drop(n)` returns a bare iterator that loses the wrapper type.
- Wraps the four eagerly-consumed `planFiles()` call sites — `getCount`,
`seekToUsableFile`, `getTableStatistics`, `asInputStream` — in `Using.resource`
so the metadata-side `CloseableIterable<FileScanTask>` is closed promptly.
**Known limitation (out of scope here):** if a caller of
`IcebergDocument.get()` / `getRange()` / `getAfter()` stops iterating before
`hasNext` returns `false` (e.g. throws mid-loop, or calls `.take(n)` and then
drops the result), the LAST file's `CloseableScalaIterator` will leak until JVM
GC. Fixing this requires changing the public `Iterator[T]` return type on
`VirtualDocument` to `Iterator[T] with AutoCloseable` and updating all callers
— best done as a separate refactor.
### Any related issues, documentation, discussions?
Closes #5143.
### How was this PR tested?
- Added `IcebergUtilLeakSpec` (2 cases): validates that
`CloseableScalaIterator` (a) closes its parent `CloseableIterable` when used
inside `Using.resource`, and (b) is idempotent under repeated `close()` calls.
- All existing iceberg specs still pass:
- `IcebergUtilSpec`: 14/14
- `IcebergUtilLeakSpec`: 2/2 (new)
- `IcebergDocumentSpec`: 18/18 (exercises the modified read iterator's
close-on-reassign / close-on-exhaustion paths against real Iceberg
infrastructure)
- `IcebergTableStatsSpec`: 12/12 (exercises `getTableStatistics` with the
new `Using.resource` wrap)
- `IcebergDocumentConsoleMessagesSpec`: 1/1
Run locally:
```
sbt "WorkflowCore/testOnly org.apache.texera.amber.util.IcebergUtilSpec
org.apache.texera.amber.util.IcebergUtilLeakSpec
org.apache.texera.amber.storage.result.iceberg.*"
```
Result: `47 succeeded, 0 failed`.
### Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (claude-opus-4-7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]