ShreyeshArangath opened a new issue, #3036:
URL: https://github.com/apache/iceberg-python/issues/3036

   ### Feature Request / Improvement
   
   ArrowScan in PyIceberg does not support true streaming, leading to OOM 
failures when processing large files (file size > container size). While the 
API returns an iterator, the implementation eagerly materializes all record 
batches for a FileScanTask before yielding the first row.
   
   Two primary bottlenecks were identified in the `pyiceberg.io.pyarrow` 
implementation:
   1. The internal scan logic uses a `list()` constructor on the batch 
iterator, forcing the entire file into memory.
   2. The `batch_size` parameter is not forwarded to the underlying PyArrow 
ds.Scanner, preventing granular memory control. Though, it does fallback to the 
standard 
   
   This behavior makes it impossible to process files larger than the available 
memory in distributed environments (e.g., Ray workers)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to