ShreyeshArangath opened a new issue, #3036: URL: https://github.com/apache/iceberg-python/issues/3036
### Feature Request / Improvement ArrowScan in PyIceberg does not support true streaming, leading to OOM failures when processing large files (file size > container size). While the API returns an iterator, the implementation eagerly materializes all record batches for a FileScanTask before yielding the first row. Two primary bottlenecks were identified in the `pyiceberg.io.pyarrow` implementation: 1. The internal scan logic uses a `list()` constructor on the batch iterator, forcing the entire file into memory. 2. The `batch_size` parameter is not forwarded to the underlying PyArrow ds.Scanner, preventing granular memory control. Though, it does fallback to the standard This behavior makes it impossible to process files larger than the available memory in distributed environments (e.g., Ray workers) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
