mengw15 opened a new pull request, #5149:
URL: https://github.com/apache/texera/pull/5149

   ### What changes were proposed in this PR?
   
   Fixes a resource leak in `IcebergUtil.readDataFileAsIterator` and five 
sibling sites in `IcebergDocument` that share the same anti-pattern:
   
   ```scala
   closeableIterable.iterator().asScala
   ```
   
   The bare `Iterator` returned to callers held no reference to its parent 
`CloseableIterable`, so the parent could never be closed. Under `S3FileIO`:
   
   1. Every read leaked one `S3InputStream` (kept open until GC because nothing 
in the call graph could close it).
   2. The leaked stream had already borrowed one slot from the AWS SDK's 
`ApacheHttpClient` connection pool (default **50**; texera did not override).
   3. After ~50 leaked reads the pool may saturate; new S3 reads then block on 
`acquireConnection` until JVM restart.
   
   This PR:
   
   - Introduces `CloseableScalaIterator[T]` (`Iterator[T] with AutoCloseable`, 
idempotent `close()`) in `IcebergUtil`, which wraps a `CloseableIterable[T]` 
and propagates `close()` to the parent.
   - Changes `IcebergUtil.readDataFileAsIterator` to return 
`CloseableScalaIterator[Record]` instead of bare `Iterator[Record]`. Callers 
must now close it (e.g. via `Using.resource`).
   - Updates the single caller in `IcebergDocument`'s read iterator to track 
the close handle in a sibling `AutoCloseable` field 
(`currentRecordIteratorCloser`) and close it on file-switch, on exhaustion, and 
on caller-imposed `until` cap. The sibling field is necessary because 
`Iterator.drop(n)` returns a bare iterator that loses the wrapper type.
   - Wraps the four eagerly-consumed `planFiles()` call sites — `getCount`, 
`seekToUsableFile`, `getTableStatistics`, `asInputStream` — in `Using.resource` 
so the metadata-side `CloseableIterable<FileScanTask>` is closed promptly.
   
   **Known limitation (out of scope here):** if a caller of 
`IcebergDocument.get()` / `getRange()` / `getAfter()` stops iterating before 
`hasNext` returns `false` (e.g. throws mid-loop, or calls `.take(n)` and then 
drops the result), the LAST file's `CloseableScalaIterator` will leak until JVM 
GC. Fixing this requires changing the public `Iterator[T]` return type on 
`VirtualDocument` to `Iterator[T] with AutoCloseable` and updating all callers 
— best done as a separate refactor.
   
   ### Any related issues, documentation, discussions?
   
   Closes #5143.
   
   ### How was this PR tested?
   
   - Added `IcebergUtilLeakSpec` (2 cases): validates that 
`CloseableScalaIterator` (a) closes its parent `CloseableIterable` when used 
inside `Using.resource`, and (b) is idempotent under repeated `close()` calls.
   - All existing iceberg specs still pass:
     - `IcebergUtilSpec`: 14/14
     - `IcebergUtilLeakSpec`: 2/2 (new)
     - `IcebergDocumentSpec`: 18/18 (exercises the modified read iterator's 
close-on-reassign / close-on-exhaustion paths against real Iceberg 
infrastructure)
     - `IcebergTableStatsSpec`: 12/12 (exercises `getTableStatistics` with the 
new `Using.resource` wrap)
     - `IcebergDocumentConsoleMessagesSpec`: 1/1
   
   Run locally:
   
   ```
   sbt "WorkflowCore/testOnly org.apache.texera.amber.util.IcebergUtilSpec 
org.apache.texera.amber.util.IcebergUtilLeakSpec 
org.apache.texera.amber.storage.result.iceberg.*"
   ```
   
   Result: `47 succeeded, 0 failed`.
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to